-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discounted Reward Calulcation (Generalized Advantage Estimation) #38
Comments
You are completely right. If the episode didn't end, you use the critic network (or the critic head of your twinheaded actor critic network) to approximate V(final_state) |
First, this repository does NOT use Generalized Advantage Estimation; it uses The only time we will get an unfinished trajectory is at the end. So an accurate version would be :
Also, the rewards to go calculation introduced in issue #8 seems to be wrong. I am a little busy now and I will look into it later. So the correct version, as other implementations use, might be just :
|
Your adjusted implementation is fine. I use the same semantic for my A2C rollouts where unfinished episodes are processed by calling critic(last_state). Else, the code just works with finished episodes. |
This inaccuracy (maybe I should call it a bug) troubles me a lot! Thanks @nikhilbarhate99 if the loop exit with |
I think we can follow the practice of the latest version of gym: add a variable
|
I want to ask one more thing about the estimation of discounted reward. The variable discounted reward always starts with zero. However, if the episode is not ended, should it be the value estimation from the critic network?
In other words, I think the pseudo code for my suggestion is
if is_terminal:
discounted_reward = 0
else:
discounted_reward = critic_network(final_state)
The text was updated successfully, but these errors were encountered: