[Proposal/Question] Incorrect documentation of NormalizeReward wrapper #1272

keraJLi · 2024-12-13T13:21:47Z

Proposal

The documentation for NormalizeReward (here in code), is partially incorrect and unclear. It states:

This wrapper will scale rewards s.t. the discounted returns have a mean of 0 and std of 1.

The wrapper does not normalize discounted returns to have a mean of 0. Instead, the rewards are merely divided by the running standard deviation of a specific term (see here). The description should be changed accordingly.
I am also concerned about the discounted returns having a variance of 1. The term we are computing the running variance of is $\sum_{t=0}^T \gamma^{T-t} r_t$, where $T$ is an episode's current time step. I would interpret this as "backwards-discounting". Importantly, this is different from the discounted return ($\sum_{t=0}^\infty \gamma^t r_t$), or even what I would call the discounted sum of previous rewards up until timestep $T$ ($\sum_{t=0}^T \gamma^t r_t$). I assume the latter is what the description means by "discounted returns", which should be clarified if that is the case.
To me, it is also unclear how dividing by this term leads to a unit variance of either the discounted return or the discounted return up to $T$. In fact, I was not able to empirically confirm that it does. In contrast, I was able to confirm $\text{Var}\left[\sum_{t=0}^T \gamma^{T-t} r_t\right] = 1$.
The reference in the wrapper's description does not mention any theoretical properties of this discounting theme.
The description also says "The exponential moving average will have variance $(1 - \gamma)^2$". I do not understand what exactly this means (and shouldn't it depend on the reward?), and would be happy about any explanation or reference that explains this.

I would be happy about any clarifications concerning points 2 and 3. I also suggest adding the relevant references to the documentation.

Motivation

I am using reward scaling myself, and find Gymnasium to be an important reference implementation. Clear documentation is important to understand the methods that Gymnasium implements.

Pitch

Change the description of the NormalizeReward wrapper to remove incorrect information.
Clarify the description (or answer my questions :) ) by adding relevant references.

Alternatives

No response

Additional context

No response

Checklist

I have checked that there is no similar issue in the repo

pseudo-rnd-thoughts · 2024-12-17T13:57:11Z

@keraJLi Thanks for the issue. NormalizeReward is the wrapper that makes me most confused so I'm not surprised if the description is wrong.
I think we can't change the implementation so if you can update documentation that would be amazing

For point 3, this seems to have originated in openai/gym#2784 with no justification for it sadly, so won't be surprised if it is wrong

Small questions

what is the mean discounted reward tend towards? Is there a fixed value?

keraJLi · 2024-12-17T15:24:00Z

For point 3, this seems to have originated in openai/gym#2784 with no justification for it sadly, so won't be surprised if it is wrong

I thought about this again and believe this is what was meant: If the rewards are iid (which they are not generally), $\text{Var}\left[\sum_{t=0}^T \gamma^t r_{T-t}\right] = \sum_{t=0}^T\gamma^{2t}\text{Var}[r_{T-t}]$, and as the sum of $\gamma^2$ is the exponential series, in the limit this is equal to $\frac{1}{1-\gamma^2} \text{Var}[r]$. I believe that the term you get for a non-independent but identical stationary reward distribution should look similar, but depend on the autocorrelation of the reward.

what is the mean discounted reward tend towards? Is there a fixed value?

I do not believe there is a fixed value, since the normalization factor is not directly related to the episodic return. After some more digging, i found that other people seemed to have investigated the same question (and came to the same conclusion): openai/gym#2387 (comment)

if you can update documentation that would be amazing

I will create a pull request in the coming days 🤗

keraJLi · 2024-12-18T15:42:52Z

Small update: After some calculations, I have confirmed that $(1-\gamma)^2$ is indeed a bound for the variance of the scaled rewards. I have not yet found any paper that mentions this and would propose to leave this out of the description unless I find one. Another possibility could be that I post my derivation in this thread, and link it.
Let me know if you have any preference.

keraJLi added the enhancement New feature or request label Dec 13, 2024

keraJLi mentioned this issue Dec 13, 2024

Add support for NormalizationRewardMixin keraJLi/rejax#17

Open

keraJLi linked a pull request Jan 2, 2025 that will close this issue

Update documentation for reward scaling wrappers #1285

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal/Question] Incorrect documentation of NormalizeReward wrapper #1272

[Proposal/Question] Incorrect documentation of NormalizeReward wrapper #1272

keraJLi commented Dec 13, 2024 •

edited

Loading

pseudo-rnd-thoughts commented Dec 17, 2024

keraJLi commented Dec 17, 2024 •

edited

Loading

keraJLi commented Dec 18, 2024 •

edited

Loading

[Proposal/Question] Incorrect documentation of NormalizeReward wrapper #1272

[Proposal/Question] Incorrect documentation of NormalizeReward wrapper #1272

Comments

keraJLi commented Dec 13, 2024 • edited Loading

Proposal

Motivation

Pitch

Alternatives

Additional context

Checklist

pseudo-rnd-thoughts commented Dec 17, 2024

keraJLi commented Dec 17, 2024 • edited Loading

keraJLi commented Dec 18, 2024 • edited Loading

keraJLi commented Dec 13, 2024 •

edited

Loading

keraJLi commented Dec 17, 2024 •

edited

Loading

keraJLi commented Dec 18, 2024 •

edited

Loading