[SkyRL-Gym] Make SQL and Search envs return None for intermediate steps#299
[SkyRL-Gym] Make SQL and Search envs return None for intermediate steps#299erictang000 wants to merge 2 commits intoNovaSky-AI:mainfrom
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request correctly modifies the SQL and Search environments to return None for intermediate step rewards, aligning with the SkyRLGymGenerator's expectations. The change to handle episode truncation in SkyRLGymGenerator by setting a None reward to 0.0 is also a good addition. My review includes a few suggestions to correct the return type hints in the modified environment methods, which were missed, and a minor documentation consistency fix.
| else: | ||
| # No reward for intermediate steps for Search tasks | ||
| return 0 | ||
| return None |
| else: | ||
| # No reward for intermediate steps for SQL tasks | ||
| return 0 | ||
| return None |
There was a problem hiding this comment.
| BaseTextEnvStepOutput containing: | ||
| - observations: New messages from the environment | ||
| - reward: Float reward for the action | ||
| - reward: Optional[Float] reward for the action, None if intermediate steps have no reward |
There was a problem hiding this comment.
For consistency with Python's type hinting syntax, it would be clearer to use Optional[float] instead of Optional[Float].
| - reward: Optional[Float] reward for the action, None if intermediate steps have no reward | |
| - reward: Optional[float] reward for the action, None if intermediate steps have no reward |
|
As discussed offline, will revert #271 first to avoid the error that Eric is observing from happening. The error is due to this line: Where it was that SkyRL/skyrl-train/skyrl_train/generators/skyrl_gym_generator.py Lines 292 to 298 in 2e7aba9 Reverting #271 will prevent the error at the cost of not seeing pass_at_n metric. Which we will follow up with a fix. |
…low due to token-level rewards" (#300) Reverts #271 as it causes errors For more, see #299 (comment)
|
closing and tracking actual fix in #311 |
…low due to token-level rewards" (#300) Reverts NovaSky-AI/SkyRL#271 as it causes errors For more, see NovaSky-AI/SkyRL#299 (comment)
…low due to token-level rewards" (#300) Reverts NovaSky-AI/SkyRL#271 as it causes errors For more, see NovaSky-AI/SkyRL#299 (comment)
…low due to token-level rewards" (NovaSky-AI#300) Reverts NovaSky-AI#271 as it causes errors For more, see NovaSky-AI#299 (comment)
Overview
After #226, the SkyRLGymGenerator expects turn level rewards to be None if the env uses trajectory level rewards. After #271, this causes issues computing metrics for SQL and Search envs. Setting intermediate reward to None and making it an optional type fixes this.
Additionally, handles case in
SkyRLGymGeneratorwhere we exit the agent loop due to hitting max length, which results in rewards being allNone- this is fixed by setting the last reward to 0.