-
Notifications
You must be signed in to change notification settings - Fork 8.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Step API with terminated, truncated bools instead of done #2752
Conversation
Old API - `done=True` if episode ends in any way. New API - `terminated=True` if environment terminates (eg. due to task completion, failure etc.) `truncated=True` if episode truncates due to a time limit Changes 1. All existing environments are changed to new API without direct support for old API 2. Vector envs are changed to new API without direct support for old API 3. All wrappers (Except TimeLimit, OrderEnforcing) are changed to new API without direct support for old API 4. TimeLimit, OrderEnforcing and wrappers which don't change the step function support both APIs. StepCompatibility Wrapper 1. This wrapper is added to support conversion from old to new API and vice versa. 2. Takes `return_two_dones` argument in __init__. `True` (new API) by default. 3. Wrapper applied at make with `return_two_dones=True` by default. It can be changed during make like `env.make("CartPole-v1", return_two_dones=False)` StepCompatibilityVector Wrapper - Transforms vector environment to old API. Set `return_two_dones=False`. Misc 1. Autoreset bug fixed (hacky) by setting local variable instead of editing self.autoreset in Env.Spec.
Im thinking about the backward compatibility of this PR, in particular for online tutorials and uni courses, etc |
@pseudo-rnd-thoughts Yes it will break code. But my point to making new API the default is just to make the code a lot simpler, with old API still supported for some (but not all) features. Most wrappers and vector code only support new API. It would get way too convoluted for all that to support both APIs. We'd need to support any code that could've arbitrarily applied any wrapper or vector code. And for the code breaking, the fix is to apply the compatibility wrappers manually (in places where it cant be toggled with |
@arjun-kg So I understand your perspective that we should try to not support one done returns and I agree with the idea. But this is a massively backward compatibility breaking change that I think we should avoid until gym 1.0 The additional issue I see with having two done returns as default is that as no warning is raised to alert users to the change then code will start failing with no obvious reason without them looking at the change logs which I suspect a majority of people are not going to do. Therefore, we are going to get a massive increase in issues around this PR. |
@pseudo-rnd-thoughts Oh that makes sense. We could have old api as default with warning saying wrappers and vector api have been upgraded and will fail unless new api is used, and maybe guide them to a doc to see how to switch api (or use compatibility wrapper) for whatever is not available in the old api. So yes like you said, it won't be an error without warning. |
Thanks, could you add a test that checks that without |
@pseudo-rnd-thoughts I've made the changes. All the tests assume new api, so argument |
@arjun-kg Thanks for doing that, the workflows haven't happened but I imagine that we are going to get a massive number of deprecation warnings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My single biggest concern is the approach taken to backwards compatibility, which makes a lot of the code needlessly complex. I'm of the moderately strong opinion that all the code inside of gym after this PR is merged, should explicitly only support the new step semantic.
Since we definitely want environments written with obs, reward, done, info
to still work, we can have a wrapper doing something like
class OldToNew:
...
def step(self, action: ActType) -> Tuple[ObsType, float, bool, bool, dict]:
outputs = self.env.step(action)
if len(outputs) == 4:
obs, reward, done, info = outputs
return obs, reward, done, False, info
elif len(outputs) == 5:
# We applied the wrapper needlessly
return outputs
else:
raise ValueError("You messed something up real hard")
class NewToOld:
...
def step(self, action: ActType) -> Tuple[ObsType, float, bool, bool, dict]:
outputs = self.env.step(action)
if len(outputs) == 4:
# We applied the wrapper needlessly
return outputs
elif len(outputs) == 5:
obs, reward, terminated, truncated, info = outputs
return obs, reward, terminated or truncated, info
else:
raise ValueError("You messed something up real hard")
Then we apply this as appropriate, giving the users some control over what will be applied, and all the code should be compatible both ways as long as environments weren't trying to do their own truncations and just used TimeLimitWrapper instead.
A bunch of individual code comments are about this exact issue before I realized that's like the only big thing I have to comment about here.
def step( | ||
self, action: ActType | ||
) -> Union[ | ||
Tuple[ObsType, float, bool, bool, dict], Tuple[ObsType, float, bool, dict] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I like this approach to backwards compatibility. If this is the official state of (for example) 0.24.0, then you can't reliably write an algorithm that will work for all valid 0.24.0 environments. I think we should just say that an environment should have the signature of (ObsType, float, bool, bool, dict), and then provide a wrapper-like compatibility layer that can convert an old-style environment to a new-style environment.
gym/core.py
Outdated
@@ -76,13 +81,17 @@ def step(self, action: ActType) -> Tuple[ObsType, float, bool, dict]: | |||
Returns: | |||
observation (object): agent's observation of the current environment. This will be an element of the environment's :attr:`observation_space`. This may, for instance, be a numpy array containing the positions and velocities of certain objects. | |||
reward (float) : amount of reward returned after previous action | |||
done (bool): whether the episode has ended, in which case further :meth:`step` calls will return undefined results. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state. ``info`` may contain additional information regarding the reason for a ``done`` signal. | |||
terminated (bool): whether the episode has ended due to a termination, in which case further step() calls will return undefined results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rephrase "termination" to something like "reaching a terminal state", or otherwise to indicate that it's about the intrinsic properties of the environment
gym/envs/registration.py
Outdated
@@ -96,6 +96,7 @@ class EnvSpec: | |||
max_episode_steps: Optional[int] = field(default=None) | |||
order_enforce: bool = field(default=True) | |||
autoreset: bool = field(default=False) | |||
return_two_dones: bool = field(default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will actually be useful. It only really has any value if people are explicitly setting it, which is a new (and very temporary) feature, and we explicitly deprecate one of the settings the moment it's introduced. This goes back to the general comment that I think the library should only support one step semantic, and compatibilities can be done through a wrapper-like thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean a wrapper that's always applied at make is preferable to an argument? Or no wrapper at make? My point was to have this to allow backward compatibility and, if users so chose to use the new api, they could just switch it to True at make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just find it confusing as to what this is supposed to do. Does this indicate what the environment intrinsically returns? Does it decide whether a wrapper should be applied?
I'd prefer a wrapper that's always applied, unless explicitly turned off, which defaults to a no-op if the environment already follows the new API
gym/wrappers/step_compatibility.py
Outdated
from gym import logger | ||
|
||
|
||
class StepCompatibility(gym.Wrapper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This desperately needs some comments and docstrings
gym/wrappers/step_compatibility.py
Outdated
self._return_two_dones = return_two_dones | ||
if not self._return_two_dones: | ||
logger.deprecation( | ||
"[StepAPI] Initializing environment in old step API which returns one bool instead of two. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The [StepAPI] thing implies there's a syntax like that in gym overall, which is not the case as far as I know, so I think it can be removed
@RedTachyon I believe the |
Good point. The thing is that all the internal code supports both old and new APIs, and I think we should stick to just one, and old-style environments will be "projected" into the new API and processed as such. I don't think there should be code in gym (outside of the explicit compatibility layer, and maybe some tests) which checks |
Right now, only core and TimeLimit has this kind of check. The only other place where there are details about compatibility is in registration and vector init, applying the compatibility wrapper, and giving the user an argument to toggle between new and old api at make. Everything else is switched cleanly to the new API. |
I'd remove it from core at least in that case. Probably also from TimeLimit, and then the compatibility wrapper would be applied as the very first wrapper, so that TimeLimit only has to ever operate on new-API envs |
Or can you maybe explain what is the exact workflow in these situations? The name
|
@RedTachyon
The reason I had TimeLimit be compatible with both APIs was for it to work smoothly in all the above scenarios. The above envs would first be wrapped by Would it be clearer if the name of the argument was |
From what I'm seeing now, the deprecation warning pops up even if I create an env with the new API, it seems to be due to some wrapper issue |
Just two questions because I think I misunderstood one thing during a call. If there's a third-party wrapper written in the old API (i.e. no mentions of new_step_api whatsoever), and it is applied to an old-API environment, everything will work alright? (except for a warning) If a third-party wrapper is written in old API, can we do: env = gym.make(something, new_step_api=True)
env = OldStyleWrapper(env) ? Is there an automatic way to make it work? Or does it require an update to the wrapper? If the answer to the first question is "yes", then after the merge conflicts are resolved, I think this can be merged. I still have... many problems, but they all seem to be necessary for now, so let's hope we can rip it out soon-ish for 1.0 (the second question doesn't affect the approval, it's just for me to understand where we're standing because there will be issues raised about it) |
Yes. This was one of the reasons behind introducing
No, this will not work (unless the wrapper doesn't need to unpack step returns). There's no 'automatic way' to do this. But the change to the wrapper is not too bad to get it to work: class OldStyleWrapper:
...
def step(self, action):
o, r, d, i = step_api_compatibility(self.env.step(action), new_step_api=False)
...
return o, r, d, i |
LGTM |
openai#2752 incorrectly removes the obligated positional parameter. This bug fix corrected it.
I came across this bug in gym: openai/gym#2752 When time limit is triggered, the output is a bool instead of a dict like {'player_0': False, 'player_1': False} This produces the following error randomly. It appears sooner or later during the training. ~/.conda/envs/lux_ai_s2/lib/python3.7/site-packages/gym/wrappers/time_limit.py in step(self, action) 16 self._elapsed_steps is not None 17 ), "Cannot call env.step() before calling reset()" ---> 18 observation, reward, done, info = self.env.step(action) 19 self._elapsed_steps += 1 20 if self._elapsed_steps >= self._max_episode_steps: /tmp/ipykernel_13436/3060603165.py in step(self, action) 57 obs, _, done, info = self.env.step(action) 58 obs = obs[agent] ---> 59 done = done[agent] 60 # if type(done) == type({}): done = done[agent] 61 # elif type(done) == type(True): done = {agent: done, opp_agent: False} TypeError: 'bool' object is not subscriptable I sugest this patch to avoid that.
Description
step
method is changed to return five items instead of four.Old API -
done=True
if episode ends in any way.New API -
terminated=True
if environment terminates (eg. due to task completion, failure etc.)truncated=True
if episode truncates due to a time limit or a reason that is not defined as part of the task MDPLink to docs - Farama-Foundation/gym-docs#115 (To be updated with latest changes)
Changes
new_step_api=True
in constructor.new_step_api=True
).terminal_reward
,terminal_observation
etc. is replaced withfinal_reward
,final_observation
etc. The intention is to reserve the 'termination' word for only ifterminated=True
. (for some motivation, Sutton and Barto uses terminal states to specifically refer to special states whose values are 0, states at the end of the MDP. This is not true for a truncation where the value of the final state need not be 0. So the current usage ofterminal_obs
etc. would be incorrect if we adopt this definition)new_step_api
argument is used in different places. It's meaning is taken to be "whether this function / class should output step values in new API or not". Eg.self.new_step_api
in a wrapper signifies whether the wrapper's step method outputs items in new API (the wrapper itself might have been written in new or old API, but through compatibility code it will output according toself.new_step_api
)play.py
alone is retained in old API due to the difficulty in having it be compatible for both APIs simultaneously, and being slightly lower priority.StepAPICompatibility Wrapper
new_step_api
argument in__init__
. False (old API) by default.new_step_api=False
by default. It can be changed during make likegym.make("CartPole-v1", new_step_api=True)
. The order of wrappers applied at make is as follows - core env -> PassiveEnvChecker -> StepAPICompatibility -> other wrappersstep_api_compatibility function
This function is similar to the wrapper, it is used for backward compatibility in wrappers, vector envs. It is used at interfaces between env / wrapper / vector / outside code. Example usage,
TimeLimit
'TimeLimit.truncated'
key ininfo
means that truncation has occurred. The boolean value it is set to refers to whether the core environment has already ended. So,info['TimeLimit.truncated']=False
, means the core environment has already terminated. We can inferterminated=True
,truncated=True
from this case.terminated=done
andtruncated=False
as there is no better information available. If TimeLimit info is available, it accordingly sets the two bools.Backward Compatibility
The PR attempts to achieve almost complete backward compatibility. However, there are cases which haven't been included. Environments directly imported eg.
from gym.envs.classic_control import CartPoleEnv
would not be backward compatible as these are rewritten in new API.StepAPICompatibility
wrapper would need to be used manually in this case. Envs made throughgym.make
all default to old API. Vector and wrappers also default to old API. These should all continue to work without problems. But due to the scale of the change, bugs are expected.Warning Details
Warnings are raised at the following locations:
gym.Wrapper
constructor - warning raised ifself.new_step_api==False
. This means any wrapper that does not explicitly passnew_step_api=True
into super() will raise the warning sinceself.new_step_api=False
by default. This is taken care of by wrappers written inside gym. Third party wrappers will face a problem in a specific situation - if the wrapper is not impacted by step API. eg. a wrapper subclassing ActionWrapper. This would work without any change for both APIs, however to avoid the warning, they still need to passnew_step_api=True
into super(). The thinking is - "If your wrapper supports new step API, you need to passnew_step_api=True
to avoid the warning".PassiveEnvChecker
,passive_env_step_check
function - if step return has 4 items a warning is raised. This happens only once since this function is only run once after env initialization. SincePassiveEnvChecker
is wrapped first before step compatibility in make, this will raise a warning based on the core env implementation's API.gym.VectorEnv
constructor - warning raised ifself.new_step_api==False
.StepAPICompatibility
wrapper constructor - the wrapper that is applied by default at make. Ifnew_step_api=False
, a warning is raised. This is independent of whether the core env is implemented in new or old api and only depends on thenew_step_api
argument.Checklist:
pre-commit
checks withpre-commit run --all-files