You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently going through the details of the P3O implementation. I wondered where the standardization of the reward advantages and cost advantages takes place and found that it is computed in the OnPolicyBuffer class and can be activated via the standardize_adv_r/c flags.
However, when looking at the implementation, I am confused why the cost advantages are not divided by the standard deviation, i.e., here.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi all,
I'm currently going through the details of the P3O implementation. I wondered where the standardization of the reward advantages and cost advantages takes place and found that it is computed in the
OnPolicyBuffer
class and can be activated via thestandardize_adv_r/c
flags.However, when looking at the implementation, I am confused why the cost advantages are not divided by the standard deviation, i.e., here.
Happy to open a PR if this is not by design.
Beta Was this translation helpful? Give feedback.
All reactions