You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support Gym's new Truncation API from release 0.25 to disambiguate between true terminal states and truncated terminations.
Motivation
In the Bellman Equation, we have to backup with ( reward + value function of the next state ) for all but the terminal states of the MDP, as discussed in the release notes of Gym here and in section 3 of this paper.
However many environments (and hence learning algorithms) do not distinguish between truncations of an infinite-MDP to increase exploration and true terminations, and both are currently passed through the done signal.
To mitigate this, starting from Gym 0.25 the step function returns a terminated and truncated bool, which allows to distinguish between the two cases. This has been found to both increase asymptotic performance and stability with respect to the chosen episode truncation length, both of which seem valid reasons to include it in this repo.
For backward compatibility, one could check the number of return variables in the step function and map the termination to done during rollout collections. I would be willing to assist under the guidance of someone more experienced to help with this.
🚀 Feature
Support Gym's new
Truncation
API from release 0.25 to disambiguate between true terminal states and truncated terminations.Motivation
In the Bellman Equation, we have to backup with ( reward + value function of the next state ) for all but the terminal states of the MDP, as discussed in the release notes of Gym here and in section 3 of this paper.
However many environments (and hence learning algorithms) do not distinguish between truncations of an infinite-MDP to increase exploration and true terminations, and both are currently passed through the
done
signal.To mitigate this, starting from Gym 0.25 the
step
function returns aterminated
andtruncated
bool, which allows to distinguish between the two cases. This has been found to both increase asymptotic performance and stability with respect to the chosen episode truncation length, both of which seem valid reasons to include it in this repo.For backward compatibility, one could check the number of return variables in the step function and map the
termination
todone
during rollout collections. I would be willing to assist under the guidance of someone more experienced to help with this.More information can be found in this issue in the gym repo
The text was updated successfully, but these errors were encountered: