Standardization of the cost advantages does not divide by the standard deviation #356

lukasfro · 2024-10-22T11:15:46Z

lukasfro
Oct 22, 2024

Hi all,

I'm currently going through the details of the P3O implementation. I wondered where the standardization of the reward advantages and cost advantages takes place and found that it is computed in the OnPolicyBuffer class and can be activated via the standardize_adv_r/c flags.

However, when looking at the implementation, I am confused why the cost advantages are not divided by the standard deviation, i.e., here.

- cadv_mean, *_ = distributed.dist_statistics_scalar(data['adv_c'])
+ cadv_mean, cadv_std, *_ = distributed.dist_statistics_scalar(data['adv_c']) 
if self._standardized_adv_c:
-    data['adv_c'] = data['adv_c'] - cadv_mean
+    data['adv_c'] = (data['adv_c'] - cadv_mean) / (cadv_std + 1e-8)

Happy to open a PR if this is not by design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardization of the cost advantages does not divide by the standard deviation #356

{{title}}

Replies: 0 comments

Select a reply

Standardization of the cost advantages does not divide by the standard deviation #356

lukasfro Oct 22, 2024

Replies: 0 comments

lukasfro
Oct 22, 2024