-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete relabelling of trajectories in HER #96
Comments
I hope this makes sense, I did not propose a PR because it will break compatibility with existing environments. |
That's a good point! That also makes a lot of sense as to why updating the termination conditioned would help. One thing that's ambiguous is whether or not the episode should end when the state is reached. Some authors assume that it should while others assume that it shouldn't. For example, the HER code released by the authors don't seem to update the termination condition. At this point, I'm not sure if I'd feel comfortable making this change for everyone, since it may not be the intended behavior and it requires changing the For now I'll pin this issue so that it has more visibility, and feel free to make a PR as suggested above. |
Thanks @vitchyr for this insightful answer.
Agreed. Indeed, HER authors have tested the algorithm on environments like gym fetch where an episode terminates only once the horizon is reached. Also, whether it is sparse or dense, they use rewards that are negative and converge to 0 when the state is getting closer to the goal, which avoid blow up issues such as the one I pointed out when the reward is positive near the goal. |
@rstrudel Did you find that the horizon termination made a difference in your experiments? (Only that part, not the -1 and 0 part) |
@bycn for my application which is neural motion planning, it did make a difference. In my experiments I got the best results setting the termination when the goal is reached and providing a positive reward in this case. It might be quite experiment dependent however. |
Hi,
First, thanks a lot for releasing such a nice repository. I have been using it for a few months now and I appreciate that it is quite well written and most of the code is self explanatory. I learned a lot just by using it.
I am using SAC-HER and got a lot of divergence issues which I fixed in the end. One of the main problem came from the relabelling of samples in the buffer: https://github.com/vitchyr/rlkit/blob/90195b24604f513403e4d0fe94db372d16700523/rlkit/data_management/obs_dict_replay_buffer.py#L228-L239
Given a batch, a set of new rewards is computed according to the updated set of goals. However the terminals are not updated, whereas some states might be terminal given the new goal.
And this has its importance in the Bellman update where the terminals variable appear https://github.com/vitchyr/rlkit/blob/90195b24604f513403e4d0fe94db372d16700523/rlkit/torch/sac/sac.py#L128
If the reward if of the form
-distance(state, goal)
and an episode is only finished because of the maximum path length, then not updating the terminals will have little impact. It may be why this bug passed silently. However I am working with a spare reward which is 1 ifdistance(state, goal) < epsilon
. And in this case, if terminals are not updated then the Q-function blows up. Indeed, if we assume thattarget_q_values = 1
at the goal, ifterminals = 0
thenq_target = 2
, at the next iterationq_target = 3
and so on. Ifterminals = 1
, i.e. if the state is terminal according to the resampled goal thenq_target = 1
.So in my fork of your repository, I replaced:
by
where terminals is 1 if
distance(state, goal) < epsilon
in my case. This fixed the Q-function blow-up issue.The text was updated successfully, but these errors were encountered: