Skip to content

Comments

[SkyRL-Gym] Make SQL and Search envs return None for intermediate steps#299

Closed
erictang000 wants to merge 2 commits intoNovaSky-AI:mainfrom
erictang000:make_rewards_optional_float
Closed

[SkyRL-Gym] Make SQL and Search envs return None for intermediate steps#299
erictang000 wants to merge 2 commits intoNovaSky-AI:mainfrom
erictang000:make_rewards_optional_float

Conversation

@erictang000
Copy link
Collaborator

@erictang000 erictang000 commented Sep 16, 2025

Overview

After #226, the SkyRLGymGenerator expects turn level rewards to be None if the env uses trajectory level rewards. After #271, this causes issues computing metrics for SQL and Search envs. Setting intermediate reward to None and making it an optional type fixes this.

Additionally, handles case in SkyRLGymGenerator where we exit the agent loop due to hitting max length, which results in rewards being all None - this is fixed by setting the last reward to 0.

@erictang000 erictang000 marked this pull request as ready for review September 16, 2025 01:32
@CharlieFRuan
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly modifies the SQL and Search environments to return None for intermediate step rewards, aligning with the SkyRLGymGenerator's expectations. The change to handle episode truncation in SkyRLGymGenerator by setting a None reward to 0.0 is also a good addition. My review includes a few suggestions to correct the return type hints in the modified environment methods, which were missed, and a minor documentation consistency fix.

else:
# No reward for intermediate steps for Search tasks
return 0
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

With this change, the function can now return None. Please update the function's return type hint on line 45 from float to Optional[float] to reflect this. Optional is already imported in this file.

else:
# No reward for intermediate steps for SQL tasks
return 0
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function now returns None for intermediate steps. The return type hint on line 78 should be updated from float to Optional[float] to match this change. You'll need to add Optional to your imports from the typing module (e.g., from typing import Optional).

BaseTextEnvStepOutput containing:
- observations: New messages from the environment
- reward: Float reward for the action
- reward: Optional[Float] reward for the action, None if intermediate steps have no reward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with Python's type hinting syntax, it would be clearer to use Optional[float] instead of Optional[Float].

Suggested change
- reward: Optional[Float] reward for the action, None if intermediate steps have no reward
- reward: Optional[float] reward for the action, None if intermediate steps have no reward

@CharlieFRuan
Copy link
Collaborator

As discussed offline, will revert #271 first to avoid the error that Eric is observing from happening.

The error is due to this line:

mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards]))

Where it was that sum(seq_rewards) doesn't work bc seq_rewards was a float, but rewards[0] is a list, which means some were lists and some were floats. This is because intermediate reward for sql and search are 0, so some trajectory's reward is [0, 1], and some trajectory's reward can be [1]. The first goes into else, the second goes into if:

if all(reward is None for reward, _ in per_step_rewards[:-1]):
# If all rewards besides the last one are None (i.e. per-trajectory reward), we keep it as a float
reward_out = per_step_rewards[-1][0]
else:
# Otherwise build token-level rewards placed at assistant turn boundaries
token_level_rewards: List[float] = [0.0] * len(response_ids)
for step_reward, idx in per_step_rewards:

Reverting #271 will prevent the error at the cost of not seeing pass_at_n metric. Which we will follow up with a fix.

@erictang000
Copy link
Collaborator Author

closing and tracking actual fix in #311

ztcanddota added a commit to ztcanddota/skyagent that referenced this pull request Sep 28, 2025
…low due to token-level rewards" (#300)

Reverts NovaSky-AI/SkyRL#271 as it causes errors

For more, see
NovaSky-AI/SkyRL#299 (comment)
SungjunlaLee added a commit to SungjunlaLee/SkyRL that referenced this pull request Jan 3, 2026
…low due to token-level rewards" (#300)

Reverts NovaSky-AI/SkyRL#271 as it causes errors

For more, see
NovaSky-AI/SkyRL#299 (comment)
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…low due to token-level rewards" (NovaSky-AI#300)

Reverts NovaSky-AI#271 as it causes errors

For more, see
NovaSky-AI#299 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants