-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paper and implementation are different. #6
Comments
We are normalizing the reward here: large-scale-curiosity/cppo_agent.py Line 140 in 0c3d179
The normalization is by running std of a sum of discounted rewards here: large-scale-curiosity/cppo_agent.py Lines 226 to 236 in 0c3d179
One caveat is that for convenience we do the discounting backwards in time rather than forwards (it's convenient because at any moment the past is fully available and the future is yet to come). |
@yburda thank you for reply. but i’ve already known that code. |
Thank you for pointing this out. We will update the paper (we reported results with a version of code very similar to the published one, so the code is representative). |
@yburda Did you mean that you did not use sum of dicounted reward? |
Yes. |
@yburda Thank you very much! :) |
Upon thinking about it a bit longer - RewardForwardFilter.rewems is None only the first time you call update. Then it assigns something to it: large-scale-curiosity/cppo_agent.py Line 233 in 0c3d179
And for all future calls to update, it's not None anymore. Sorry for the temporary confusion. |
@yburda Oh I was so stupid... Thank you for letting me know. |
Sir, I think the code is not right; |
In paper,
but, implementation just use reward. not sum of discounted reward.
Why is it different?
The text was updated successfully, but these errors were encountered: