Skip to content

Comments

[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop flow due to token-level rewards#271

Merged
tyler-griggs merged 4 commits intoNovaSky-AI:mainfrom
CharlieFRuan:fix-0908-fix-reward
Sep 11, 2025
Merged

[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop flow due to token-level rewards#271
tyler-griggs merged 4 commits intoNovaSky-AI:mainfrom
CharlieFRuan:fix-0908-fix-reward

Conversation

@CharlieFRuan
Copy link
Collaborator

In #226, we started building per-token reward for the agent_loop() codepath to enable per-step reward. However, in get_metrics_from_generator_output(), we do not compute pass_at_n for token-level rewards:

def get_metrics_from_generator_output(
    generator_output: GeneratorOutput, uids: List[str]
) -> Tuple[float, Optional[float]]:
    ...
    if isinstance(rewards[0], list):
        # We just compute mean over sequence reward.
        # TODO: We should make metrics customizable by the environment
        mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards]))
        pass_at_n = None  # not computed for token-level rewards since it's ill-defined
    else:
        ...

This PR resolves this by still using per-trajectory reward when all intermediate rewards are None.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where pass_at_k metrics were not being calculated for the agent_loop flow due to token-level rewards. The introduced logic correctly distinguishes between per-trajectory and per-token reward scenarios, ensuring that rewards are formatted as a float when appropriate, which allows for the pass_at_k calculation. The implementation is sound and effectively resolves the problem. I've included one suggestion to refactor a loop into a more concise and readable single-line expression.

CharlieFRuan and others added 3 commits September 8, 2025 23:54
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@CharlieFRuan
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of pass_at_k metrics not being computed for certain reward structures. The change in skyrl_train/generators/skyrl_gym_generator.py correctly distinguishes between per-trajectory and per-token rewards by checking if intermediate rewards are None. This ensures that per-trajectory rewards are passed as a single float, enabling the metric calculation. The updated tests in skyrl-train/tests/cpu/generators/test_skyrl_gym_generator.py are well-designed, using parameterization to validate both scenarios and confirming the fix. The code is clear, correct, and well-tested.

@tyler-griggs tyler-griggs merged commit 3ffaa10 into NovaSky-AI:main Sep 11, 2025
3 checks passed
erictang000 pushed a commit that referenced this pull request Sep 16, 2025
…low due to token-level rewards" (#300)

Reverts #271 as it causes errors

For more, see
#299 (comment)
CharlieFRuan added a commit that referenced this pull request Sep 22, 2025
Fixes #311.

### PRs around this issue
- `pass_at_n` no longer computed for multi-turn rollouts after
#226
- This PR fixed it by introducing `None` reward, which is ill-defined
and later reverted: #271

### This PR
- Deeming the last turn's reward as the entire trajectory's reward (and
being > 0 signifies a "pass") for the purpose of computing `pass@N`
- Adding documentation about (per-turn) rewards, metrics, and per-token
rewards conversion (for better intuition) in `Creating a New Environment
or Task` for lack of a better place to put it at
- Add unit test and more documentation to the metric util
- Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`,
since our `BaseTextEnvStepOutput.reward` is `float` and not
`Optional[float]`. Also added corresponding documentation, saying
returning `0.0` as reward for intermediate turns if not using turn-level
reward
- Also, added a minor fix to pass_at_n computation, where negative
reward is taken into account. See this comment for more:
#317 (comment)

### Test
- Ran `run_gsm8k.sh` with:
- orange: `batched=True` (previously working already, since it was not
the `agent_loop()` codepath that does not convert to per-token rewards)
- green: `batched=False` (the codepath where `pass_at_n` is not computed
prior to this PR)
  - grey: baseline from a previous stable PR's run
<img width="1101" height="574" alt="image"
src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7"
/>


### Rendered doc
<img width="1116" height="933" alt="image"
src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
ztcanddota added a commit to ztcanddota/skyagent that referenced this pull request Sep 28, 2025
…low due to token-level rewards" (#300)

Reverts NovaSky-AI/SkyRL#271 as it causes errors

For more, see
NovaSky-AI/SkyRL#299 (comment)
ztcanddota added a commit to ztcanddota/skyagent that referenced this pull request Sep 28, 2025
Fixes NovaSky-AI/SkyRL#311.

### PRs around this issue
- `pass_at_n` no longer computed for multi-turn rollouts after
NovaSky-AI/SkyRL#226
- This PR fixed it by introducing `None` reward, which is ill-defined
and later reverted: NovaSky-AI/SkyRL#271

### This PR
- Deeming the last turn's reward as the entire trajectory's reward (and
being > 0 signifies a "pass") for the purpose of computing `pass@N`
- Adding documentation about (per-turn) rewards, metrics, and per-token
rewards conversion (for better intuition) in `Creating a New Environment
or Task` for lack of a better place to put it at
- Add unit test and more documentation to the metric util
- Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`,
since our `BaseTextEnvStepOutput.reward` is `float` and not
`Optional[float]`. Also added corresponding documentation, saying
returning `0.0` as reward for intermediate turns if not using turn-level
reward
- Also, added a minor fix to pass_at_n computation, where negative
reward is taken into account. See this comment for more:
NovaSky-AI/SkyRL#317 (comment)

### Test
- Ran `run_gsm8k.sh` with:
- orange: `batched=True` (previously working already, since it was not
the `agent_loop()` codepath that does not convert to per-token rewards)
- green: `batched=False` (the codepath where `pass_at_n` is not computed
prior to this PR)
  - grey: baseline from a previous stable PR's run
<img width="1101" height="574" alt="image"
src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7"
/>


### Rendered doc
<img width="1116" height="933" alt="image"
src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
SungjunlaLee added a commit to SungjunlaLee/SkyRL that referenced this pull request Jan 3, 2026
…low due to token-level rewards" (#300)

Reverts NovaSky-AI/SkyRL#271 as it causes errors

For more, see
NovaSky-AI/SkyRL#299 (comment)
SungjunlaLee added a commit to SungjunlaLee/SkyRL that referenced this pull request Jan 3, 2026
Fixes NovaSky-AI/SkyRL#311.

### PRs around this issue
- `pass_at_n` no longer computed for multi-turn rollouts after
NovaSky-AI/SkyRL#226
- This PR fixed it by introducing `None` reward, which is ill-defined
and later reverted: NovaSky-AI/SkyRL#271

### This PR
- Deeming the last turn's reward as the entire trajectory's reward (and
being > 0 signifies a "pass") for the purpose of computing `pass@N`
- Adding documentation about (per-turn) rewards, metrics, and per-token
rewards conversion (for better intuition) in `Creating a New Environment
or Task` for lack of a better place to put it at
- Add unit test and more documentation to the metric util
- Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`,
since our `BaseTextEnvStepOutput.reward` is `float` and not
`Optional[float]`. Also added corresponding documentation, saying
returning `0.0` as reward for intermediate turns if not using turn-level
reward
- Also, added a minor fix to pass_at_n computation, where negative
reward is taken into account. See this comment for more:
NovaSky-AI/SkyRL#317 (comment)

### Test
- Ran `run_gsm8k.sh` with:
- orange: `batched=True` (previously working already, since it was not
the `agent_loop()` codepath that does not convert to per-token rewards)
- green: `batched=False` (the codepath where `pass_at_n` is not computed
prior to this PR)
  - grey: baseline from a previous stable PR's run
<img width="1101" height="574" alt="image"
src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7"
/>


### Rendered doc
<img width="1116" height="933" alt="image"
src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…to token-level rewards (NovaSky-AI#271)

In NovaSky-AI#226, we started building
per-token reward for the `agent_loop()` codepath to enable per-step
reward. However, in `get_metrics_from_generator_output()`, we do not
compute pass_at_n for token-level rewards:

```python
def get_metrics_from_generator_output(
    generator_output: GeneratorOutput, uids: List[str]
) -> Tuple[float, Optional[float]]:
    ...
    if isinstance(rewards[0], list):
        # We just compute mean over sequence reward.
        # TODO: We should make metrics customizable by the environment
        mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards]))
        pass_at_n = None  # not computed for token-level rewards since it's ill-defined
    else:
        ...
```

This PR resolves this by still using per-trajectory reward when all
intermediate rewards are None.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…low due to token-level rewards" (NovaSky-AI#300)

Reverts NovaSky-AI#271 as it causes errors

For more, see
NovaSky-AI#299 (comment)
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…aSky-AI#317)

Fixes NovaSky-AI#311.

### PRs around this issue
- `pass_at_n` no longer computed for multi-turn rollouts after
NovaSky-AI#226
- This PR fixed it by introducing `None` reward, which is ill-defined
and later reverted: NovaSky-AI#271

### This PR
- Deeming the last turn's reward as the entire trajectory's reward (and
being > 0 signifies a "pass") for the purpose of computing `pass@N`
- Adding documentation about (per-turn) rewards, metrics, and per-token
rewards conversion (for better intuition) in `Creating a New Environment
or Task` for lack of a better place to put it at
- Add unit test and more documentation to the metric util
- Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`,
since our `BaseTextEnvStepOutput.reward` is `float` and not
`Optional[float]`. Also added corresponding documentation, saying
returning `0.0` as reward for intermediate turns if not using turn-level
reward
- Also, added a minor fix to pass_at_n computation, where negative
reward is taken into account. See this comment for more:
NovaSky-AI#317 (comment)

### Test
- Ran `run_gsm8k.sh` with:
- orange: `batched=True` (previously working already, since it was not
the `agent_loop()` codepath that does not convert to per-token rewards)
- green: `batched=False` (the codepath where `pass_at_n` is not computed
prior to this PR)
  - grey: baseline from a previous stable PR's run
<img width="1101" height="574" alt="image"
src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7"
/>


### Rendered doc
<img width="1116" height="933" alt="image"
src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants