Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Step API with terminated, truncated bools instead of done #2752

Merged
merged 52 commits into from
Jul 9, 2022

Conversation

arjun-kg
Copy link
Contributor

@arjun-kg arjun-kg commented Apr 14, 2022

Description

step method is changed to return five items instead of four.

Old API - done=True if episode ends in any way.

New API -
terminated=True if environment terminates (eg. due to task completion, failure etc.)
truncated=True if episode truncates due to a time limit or a reason that is not defined as part of the task MDP

Link to docs - Farama-Foundation/gym-docs#115 (To be updated with latest changes)

Changes

  1. All existing environment implementations are changed to new API without direct support for old API. However gym.make for any environment will default to old API through a compatibility wrapper.
  2. Vector env implementations are changed to new API, with backward compatibility for old API, defaulting to old API. New API can set by a newly added argument new_step_api=True in constructor.
  3. All wrapper implementations are changed to new API, and have backward compatibility and default to old API (can be switched to new API with new_step_api=True).
  4. Some changes in phrasing - terminal_reward, terminal_observation etc. is replaced with final_reward, final_observation etc. The intention is to reserve the 'termination' word for only if terminated=True. (for some motivation, Sutton and Barto uses terminal states to specifically refer to special states whose values are 0, states at the end of the MDP. This is not true for a truncation where the value of the final state need not be 0. So the current usage of terminal_obs etc. would be incorrect if we adopt this definition)
  5. All tests are continued to be performed for old API (since the default is old API for now). A single exception for when the test env is unwrapped and so the compatibility wrapper doesn't apply. Also, special tests are added just for testing new API.
  6. new_step_api argument is used in different places. It's meaning is taken to be "whether this function / class should output step values in new API or not". Eg. self.new_step_api in a wrapper signifies whether the wrapper's step method outputs items in new API (the wrapper itself might have been written in new or old API, but through compatibility code it will output according to self.new_step_api)
  7. play.py alone is retained in old API due to the difficulty in having it be compatible for both APIs simultaneously, and being slightly lower priority.

StepAPICompatibility Wrapper

  1. This wrapper is added to support conversion from old to new API and vice versa.
  2. Takes new_step_api argument in __init__. False (old API) by default.
  3. Wrapper applied at make with new_step_api=False by default. It can be changed during make like gym.make("CartPole-v1", new_step_api=True). The order of wrappers applied at make is as follows - core env -> PassiveEnvChecker -> StepAPICompatibility -> other wrappers

step_api_compatibility function

This function is similar to the wrapper, it is used for backward compatibility in wrappers, vector envs. It is used at interfaces between env / wrapper / vector / outside code. Example usage,

# wrapper's step method
def step(self, action):

    # here self.env.step is made to return in new API, since the wrapper is written in new API
    obs, rew, terminated, truncated, info = step_api_compatibility(self.env.step(action), new_step_api=True) 

    if terminated or truncated:
        print("Episode end")
    ### more wrapper code

    # here the wrapper is made to return in API specified by self.new_step_api, that is set to False by default, and can be changed according to the situation
    return step_api_compatibility((obs, rew, terminated, truncated, info), new_step_api=self.new_step_api) 

TimeLimit

  1. In the current implementation of the timelimit wrapper, existence of 'TimeLimit.truncated' key in info means that truncation has occurred. The boolean value it is set to refers to whether the core environment has already ended. So, info['TimeLimit.truncated']=False, means the core environment has already terminated. We can infer terminated=True, truncated=True from this case.
  2. To change old API to new, the compatibility function first checks info. If there is nothing in info, it returns terminated=done and truncated=False as there is no better information available. If TimeLimit info is available, it accordingly sets the two bools.

Backward Compatibility

The PR attempts to achieve almost complete backward compatibility. However, there are cases which haven't been included. Environments directly imported eg. from gym.envs.classic_control import CartPoleEnv would not be backward compatible as these are rewritten in new API. StepAPICompatibility wrapper would need to be used manually in this case. Envs made through gym.make all default to old API. Vector and wrappers also default to old API. These should all continue to work without problems. But due to the scale of the change, bugs are expected.

Warning Details

Warnings are raised at the following locations:

  1. gym.Wrapper constructor - warning raised if self.new_step_api==False. This means any wrapper that does not explicitly pass new_step_api=True into super() will raise the warning since self.new_step_api=False by default. This is taken care of by wrappers written inside gym. Third party wrappers will face a problem in a specific situation - if the wrapper is not impacted by step API. eg. a wrapper subclassing ActionWrapper. This would work without any change for both APIs, however to avoid the warning, they still need to pass new_step_api=True into super(). The thinking is - "If your wrapper supports new step API, you need to pass new_step_api=True to avoid the warning".
  2. PassiveEnvChecker, passive_env_step_check function - if step return has 4 items a warning is raised. This happens only once since this function is only run once after env initialization. Since PassiveEnvChecker is wrapped first before step compatibility in make, this will raise a warning based on the core env implementation's API.
  3. gym.VectorEnv constructor - warning raised if self.new_step_api==False.
  4. StepAPICompatibility wrapper constructor - the wrapper that is applied by default at make. If new_step_api=False, a warning is raised. This is independent of whether the core env is implemented in new or old api and only depends on the new_step_api argument.
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (need to update with latest changes)
  • My changes generate no new warnings (only intentional warnings)
  • I have added tests that prove my fix is effective or that my feature works (added two tests but maybe needs more to be comprehensive)
  • New and existing unit tests pass locally with my changes
  • Locally runs with atari / pybullet envs

Old API - `done=True` if episode ends in any way.

New API -
`terminated=True` if environment terminates (eg. due to task completion, failure etc.)
`truncated=True` if episode truncates due to a time limit

Changes
1. All existing environments are changed to new API without direct support for old API
2. Vector envs are changed to new API without direct support for old API
3. All wrappers (Except TimeLimit, OrderEnforcing) are changed to new API without direct support for old API
4. TimeLimit, OrderEnforcing and wrappers which don't change the step function support both APIs.

StepCompatibility Wrapper
1. This wrapper is added to support conversion from old to new API and vice versa.
2. Takes `return_two_dones` argument in __init__. `True` (new API) by default.
3. Wrapper applied at make with  `return_two_dones=True` by default. It can be changed during make like `env.make("CartPole-v1", return_two_dones=False)`

StepCompatibilityVector Wrapper - Transforms vector environment to old API. Set `return_two_dones=False`.

Misc
1. Autoreset bug fixed (hacky) by setting local variable instead of editing self.autoreset in Env.Spec.
gym/core.py Show resolved Hide resolved
gym/core.py Show resolved Hide resolved
@pseudo-rnd-thoughts
Copy link
Contributor

Im thinking about the backward compatibility of this PR, in particular for online tutorials and uni courses, etc
As return_two_dones=True is the default for all environments, will this not break a huge amount of code without a warning being raised as it is the expected value now.
Is it not better to have return_two_dones=False for all environments until 1.0 with the warnings being raised explaining to people to change across and that all wrappers have been modified to expect this now?

@arjun-kg
Copy link
Contributor Author

arjun-kg commented Apr 17, 2022

@pseudo-rnd-thoughts Yes it will break code. But my point to making new API the default is just to make the code a lot simpler, with old API still supported for some (but not all) features. Most wrappers and vector code only support new API. It would get way too convoluted for all that to support both APIs. We'd need to support any code that could've arbitrarily applied any wrapper or vector code.

And for the code breaking, the fix is to apply the compatibility wrappers manually (in places where it cant be toggled with return_two_dones) which is a little better than immediately having to rewrite the broken code but yes, definitely not ideal. Maybe it's worth the complicated code to retain complete compatibility, at this point I'm not sure which way is better and wouldn't mind either way based on a consensus.

@pseudo-rnd-thoughts
Copy link
Contributor

@arjun-kg So I understand your perspective that we should try to not support one done returns and I agree with the idea.

But this is a massively backward compatibility breaking change that I think we should avoid until gym 1.0
Personally, I think that the default should be one done return with a warning being raised that all wrappers have been modified to expect environments with two done returns. Plus details that this will be changed to two done returns at gym 1.0 plus the compatibility wrappers to change environments from one to two and vice versa.

The additional issue I see with having two done returns as default is that as no warning is raised to alert users to the change then code will start failing with no obvious reason without them looking at the change logs which I suspect a majority of people are not going to do. Therefore, we are going to get a massive increase in issues around this PR.

@arjun-kg
Copy link
Contributor Author

@pseudo-rnd-thoughts Oh that makes sense. We could have old api as default with warning saying wrappers and vector api have been upgraded and will fail unless new api is used, and maybe guide them to a doc to see how to switch api (or use compatibility wrapper) for whatever is not available in the old api. So yes like you said, it won't be an error without warning.

@pseudo-rnd-thoughts
Copy link
Contributor

Thanks, could you add a test that checks that without return_two_dones=True then the warning is raised.

@arjun-kg
Copy link
Contributor Author

@pseudo-rnd-thoughts I've made the changes. All the tests assume new api, so argument return_two_dones=True is passed for all env making in tests.

@pseudo-rnd-thoughts
Copy link
Contributor

@arjun-kg Thanks for doing that, the workflows haven't happened but I imagine that we are going to get a massive number of deprecation warnings.
If true, would you these be able to silence these warnings in the CI, you can pass a parameter to pytest to ignore certain warnings or add a @pytest.mark.silencewarnings (I believe) to all of the test functions. The first opinion is probably best.

Copy link
Contributor

@RedTachyon RedTachyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My single biggest concern is the approach taken to backwards compatibility, which makes a lot of the code needlessly complex. I'm of the moderately strong opinion that all the code inside of gym after this PR is merged, should explicitly only support the new step semantic.

Since we definitely want environments written with obs, reward, done, info to still work, we can have a wrapper doing something like

class OldToNew:
    ...
    def step(self, action: ActType) -> Tuple[ObsType, float, bool, bool, dict]:
        outputs = self.env.step(action)
        if len(outputs) == 4:
            obs, reward, done, info = outputs
            return obs, reward, done, False, info
        elif len(outputs) == 5:
            # We applied the wrapper needlessly
            return outputs
        else:
            raise ValueError("You messed something up real hard")
            
            
class NewToOld:
    ...
    def step(self, action: ActType) -> Tuple[ObsType, float, bool, bool, dict]:
        outputs = self.env.step(action)
        if len(outputs) == 4:
            # We applied the wrapper needlessly
            return outputs
        elif len(outputs) == 5:
            obs, reward, terminated, truncated, info = outputs
            return obs, reward, terminated or truncated, info
        else:
            raise ValueError("You messed something up real hard")

Then we apply this as appropriate, giving the users some control over what will be applied, and all the code should be compatible both ways as long as environments weren't trying to do their own truncations and just used TimeLimitWrapper instead.

A bunch of individual code comments are about this exact issue before I realized that's like the only big thing I have to comment about here.

def step(
self, action: ActType
) -> Union[
Tuple[ObsType, float, bool, bool, dict], Tuple[ObsType, float, bool, dict]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I like this approach to backwards compatibility. If this is the official state of (for example) 0.24.0, then you can't reliably write an algorithm that will work for all valid 0.24.0 environments. I think we should just say that an environment should have the signature of (ObsType, float, bool, bool, dict), and then provide a wrapper-like compatibility layer that can convert an old-style environment to a new-style environment.

gym/core.py Outdated
@@ -76,13 +81,17 @@ def step(self, action: ActType) -> Tuple[ObsType, float, bool, dict]:
Returns:
observation (object): agent's observation of the current environment. This will be an element of the environment's :attr:`observation_space`. This may, for instance, be a numpy array containing the positions and velocities of certain objects.
reward (float) : amount of reward returned after previous action
done (bool): whether the episode has ended, in which case further :meth:`step` calls will return undefined results. A done signal may be emitted for different reasons: Maybe the task underlying the environment was solved successfully, a certain timelimit was exceeded, or the physics simulation has entered an invalid state. ``info`` may contain additional information regarding the reason for a ``done`` signal.
terminated (bool): whether the episode has ended due to a termination, in which case further step() calls will return undefined results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase "termination" to something like "reaching a terminal state", or otherwise to indicate that it's about the intrinsic properties of the environment

@@ -96,6 +96,7 @@ class EnvSpec:
max_episode_steps: Optional[int] = field(default=None)
order_enforce: bool = field(default=True)
autoreset: bool = field(default=False)
return_two_dones: bool = field(default=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will actually be useful. It only really has any value if people are explicitly setting it, which is a new (and very temporary) feature, and we explicitly deprecate one of the settings the moment it's introduced. This goes back to the general comment that I think the library should only support one step semantic, and compatibilities can be done through a wrapper-like thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a wrapper that's always applied at make is preferable to an argument? Or no wrapper at make? My point was to have this to allow backward compatibility and, if users so chose to use the new api, they could just switch it to True at make.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just find it confusing as to what this is supposed to do. Does this indicate what the environment intrinsically returns? Does it decide whether a wrapper should be applied?

I'd prefer a wrapper that's always applied, unless explicitly turned off, which defaults to a no-op if the environment already follows the new API

from gym import logger


class StepCompatibility(gym.Wrapper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This desperately needs some comments and docstrings

self._return_two_dones = return_two_dones
if not self._return_two_dones:
logger.deprecation(
"[StepAPI] Initializing environment in old step API which returns one bool instead of two. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [StepAPI] thing implies there's a syntax like that in gym overall, which is not the case as far as I know, so I think it can be removed

@pseudo-rnd-thoughts
Copy link
Contributor

@RedTachyon I believe the StepCompatibility wrapper in gym/wrappers/step_compatibility and gym/vector/step_compatibility are exactly what you describe as OldtoNew and NewtoOld
Currently, all environment created using .make use the backward compatibility wrapper with one done but raise a warning to users about the two done API and the changes made.
To use the two done API, user just add the return_two_done=True as a kwargs for .make

@RedTachyon
Copy link
Contributor

RedTachyon commented Apr 24, 2022

Good point. The thing is that all the internal code supports both old and new APIs, and I think we should stick to just one, and old-style environments will be "projected" into the new API and processed as such. I don't think there should be code in gym (outside of the explicit compatibility layer, and maybe some tests) which checks if len(outputs) == 4: ... else: ...

@arjun-kg
Copy link
Contributor Author

I don't think there should be code in gym (outside of the explicit compatibility layer, and maybe some tests) which checks if len(outputs) == 4: ... else: ...

Right now, only core and TimeLimit has this kind of check. The only other place where there are details about compatibility is in registration and vector init, applying the compatibility wrapper, and giving the user an argument to toggle between new and old api at make. Everything else is switched cleanly to the new API.

@RedTachyon
Copy link
Contributor

I'd remove it from core at least in that case. Probably also from TimeLimit, and then the compatibility wrapper would be applied as the very first wrapper, so that TimeLimit only has to ever operate on new-API envs

@RedTachyon
Copy link
Contributor

Or can you maybe explain what is the exact workflow in these situations? The name return_two_renders is very unclear to me.

  1. I'm calling gym.make("AntBulletEnv-v0") on an env written with the old API, and want it to follow the new API
  2. I'm calling the same thing, want it to follow the old API (is this possible?)
  3. I'm creating a brand well-written new env, I want it to follow the new API
  4. I'm creating the same thing, I want it to follow the old API (is this possible?)

@arjun-kg
Copy link
Contributor Author

arjun-kg commented Apr 25, 2022

@RedTachyon
So the compatibility wrapper now is always applied and switches to old api no matter what the style of the core env. return_two_dones essentially means whether the wrapped env is in new API or not (irrespective of the input env). The compatibility wrapper is always applied at make. and return_two_dones=False by default so no matter what env is sent in, it switches to old API by default unless return_two_dones=True is specified. For your questions, all four scenarios are possible -

  1. gym.make("AntBulletEnv-v0", return_two_dones=True) - the old api is explicitly converted to new api through the wrapper
  2. gym.make("AntBulletEnv-v0") - the wrapper is present but passes through without changes. This also ensures old api envs are backward compatible.
  3. gym.make("BrandNewAPIEnv-v0", return_two_dones=True) - the wrapper is present but passes through without changes
  4. gym.make("BrandNewAPIEnv-v0") - the new api is explicitly converted to old api through the wrapper. This ensures all the envs in gym that are in new api (eg. CartPole) are backward compatible. Codes already using CartPole will switch to the old API with a warning. Ofc the minor inconvenience is that, for people who completely switch to new API, they will still have to send the return_two_dones=True argument till 1.0.

The reason I had TimeLimit be compatible with both APIs was for it to work smoothly in all the above scenarios. The above envs would first be wrapped by StepCompatibility wrapper, then wrapped by order enforcing and then timelimit and work smoothly irrespective of how the environments were transformed.

Would it be clearer if the name of the argument was new_step_api instead of return_two_dones?

@RedTachyon
Copy link
Contributor

From what I'm seeing now, the deprecation warning pops up even if I create an env with the new API, it seems to be due to some wrapper issue
https://colab.research.google.com/drive/16TdzbKkGCRusKGSNSTUrnpRmqWsFoKBn?usp=sharing

@RedTachyon
Copy link
Contributor

RedTachyon commented Jul 6, 2022

Just two questions because I think I misunderstood one thing during a call. If there's a third-party wrapper written in the old API (i.e. no mentions of new_step_api whatsoever), and it is applied to an old-API environment, everything will work alright? (except for a warning)

If a third-party wrapper is written in old API, can we do:

env = gym.make(something, new_step_api=True)
env = OldStyleWrapper(env)

? Is there an automatic way to make it work? Or does it require an update to the wrapper?

If the answer to the first question is "yes", then after the merge conflicts are resolved, I think this can be merged. I still have... many problems, but they all seem to be necessary for now, so let's hope we can rip it out soon-ish for 1.0

(the second question doesn't affect the approval, it's just for me to understand where we're standing because there will be issues raised about it)

@arjun-kg
Copy link
Contributor Author

arjun-kg commented Jul 9, 2022

@RedTachyon

If there's a third-party wrapper written in the old API (i.e. no mentions of new_step_api whatsoever), and it is applied to an old-API environment, everything will work alright?

Yes. This was one of the reasons behind introducing new_step_api attribute in the base wrapper and having it default to False, to have this kind of backward compatibility.

? Is there an automatic way to make it work? Or does it require an update to the wrapper?

No, this will not work (unless the wrapper doesn't need to unpack step returns). There's no 'automatic way' to do this. But the change to the wrapper is not too bad to get it to work:

class OldStyleWrapper:
   ...
   def step(self, action):
        o, r, d, i = step_api_compatibility(self.env.step(action), new_step_api=False)
        ...
        return o, r, d, i

@pseudo-rnd-thoughts
Copy link
Contributor

LGTM

@jkterry1 jkterry1 merged commit 907b1b2 into openai:master Jul 9, 2022
ZhiqingXiao added a commit to ZhiqingXiao/gym that referenced this pull request Jul 17, 2022
openai#2752 incorrectly removes the obligated positional parameter. This bug fix corrected it.
jxtrbtk added a commit to jxtrbtk/Lux-Design-S2 that referenced this pull request Feb 27, 2023
I came across this bug in gym: openai/gym#2752
When time limit is triggered, the output is a bool instead of a dict like {'player_0': False, 'player_1': False}

This produces the following error randomly. It appears sooner or later during the training.

~/.conda/envs/lux_ai_s2/lib/python3.7/site-packages/gym/wrappers/time_limit.py in step(self, action)
     16             self._elapsed_steps is not None
     17         ), "Cannot call env.step() before calling reset()"
---> 18         observation, reward, done, info = self.env.step(action)
     19         self._elapsed_steps += 1
     20         if self._elapsed_steps >= self._max_episode_steps:

/tmp/ipykernel_13436/3060603165.py in step(self, action)
     57         obs, _, done, info = self.env.step(action)
     58         obs = obs[agent]
---> 59         done = done[agent]
     60 #         if type(done) == type({}): done = done[agent]
     61 #         elif type(done) == type(True): done = {agent: done, opp_agent: False}

TypeError: 'bool' object is not subscriptable

I sugest this patch to avoid that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants