-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
policy_old完全看不出作用 #65
Comments
你这个流程完全等同于
|
You are right, but the author is also right to do so. I think the current process is not redundant:
|
感谢回复,你的观点我认同,old Policy确实可以用于计算KL散度,让模型本次更新不至于过大 另外当我知道PPO绕了一大圈,最后跟我说另外个分布就是上次的分布时,我是真的想掀桌子 整个PPO完全就是工程化的时候,看到for循环迭代缺少个阻尼项,而做的尝试。实际写出的论文却如此晦涩 |
Although the motivation of PPO might not be that simple, I guess partial of PPO's motivation were found when applying TRPO in engineering. Actually, introducing KL divergence into PPO to early stop is just a trick. The authors of PPO paper not meant to do that. In my opinion, PPO is motivated to solve TRPO's computational complexity problem. Instead of computing KL divergence (very slow) like TRPO, PPO(clip version) simply limit the policy's update with a clip() function. You can check it deeper by viewing OpenAI Spinning Up for PPO, which I have cited it as an url below:)
——from OpenAI Spinning Up for PPO |
嗯,是的。如果从TRPO出发,PPO的改进是成功的 再次感谢你的回答,我后续会看下OpenAI的版本 |
You are welcomed :) |
现在的数据流程是
在这个流程中,policy_old完全没有作用,或者说代码中去掉policy_old,使用policy进行替代,最终的结果完全一致
所以这个真的是PPO么??
The text was updated successfully, but these errors were encountered: