-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TRPO或者PPO程序疑问 #96
Comments
你如果用了==这样的判据的话确实不是相等的,因为前者detach了而后者没有,可以像代码里这个方法中一样,看一下两个的ratio是不是1 |
他们在第一次的时候是相等的,后面log-probs会被梯度更新,但是old probs不会变,保证模型在几个epoch的训练下是不变的 |
运行代码,确实第一次相等,ratio=1,但actor网络的梯度不为0,导致后面log-probs会更新,整个系统就启动了。但问题是,ratio=1的时候,整个目标函数只跟A有关,跟actor网络参数无关,actor网络的梯度应该=0才对啊,为什么运行后梯度不等于0呢? |
搞清楚了, 第一次更新时,log_prob和old_log_prob相等,但梯度不为零,原因在于log_prob和old_log_prob只是数值相等,不代表跟动作输出概率无关,也就不代表跟策略参数无关。事实上,由于old_log_prob的detach()操作,old_log_prob可以被看成与变量无关的常数,而log_prob才是真正的变量。首次更新时,只能说log_prob变量的值正好跟old_log_prob相等,不代表函数跟变量log_prob无关。 |
old_log_probs计算完成后立即开始计算log_probs,它们之间并没有进行策略梯度参数更新的操作,为什么他们不相等?
The text was updated successfully, but these errors were encountered: