Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal/Question] Incorrect documentation of NormalizeReward wrapper #1272

Open
1 task done
keraJLi opened this issue Dec 13, 2024 · 3 comments · May be fixed by #1285
Open
1 task done

[Proposal/Question] Incorrect documentation of NormalizeReward wrapper #1272

keraJLi opened this issue Dec 13, 2024 · 3 comments · May be fixed by #1285
Labels
enhancement New feature or request

Comments

@keraJLi
Copy link

keraJLi commented Dec 13, 2024

Proposal

The documentation for NormalizeReward (here in code), is partially incorrect and unclear. It states:

This wrapper will scale rewards s.t. the discounted returns have a mean of 0 and std of 1.

  1. The wrapper does not normalize discounted returns to have a mean of 0. Instead, the rewards are merely divided by the running standard deviation of a specific term (see here). The description should be changed accordingly.

  2. I am also concerned about the discounted returns having a variance of 1. The term we are computing the running variance of is $\sum_{t=0}^T \gamma^{T-t} r_t$, where $T$ is an episode's current time step. I would interpret this as "backwards-discounting". Importantly, this is different from the discounted return ($\sum_{t=0}^\infty \gamma^t r_t$), or even what I would call the discounted sum of previous rewards up until timestep $T$ ($\sum_{t=0}^T \gamma^t r_t$). I assume the latter is what the description means by "discounted returns", which should be clarified if that is the case.
    To me, it is also unclear how dividing by this term leads to a unit variance of either the discounted return or the discounted return up to $T$. In fact, I was not able to empirically confirm that it does. In contrast, I was able to confirm $\text{Var}\left[\sum_{t=0}^T \gamma^{T-t} r_t\right] = 1$.
    The reference in the wrapper's description does not mention any theoretical properties of this discounting theme.

  3. The description also says "The exponential moving average will have variance $(1 - \gamma)^2$". I do not understand what exactly this means (and shouldn't it depend on the reward?), and would be happy about any explanation or reference that explains this.

I would be happy about any clarifications concerning points 2 and 3. I also suggest adding the relevant references to the documentation.

Motivation

I am using reward scaling myself, and find Gymnasium to be an important reference implementation. Clear documentation is important to understand the methods that Gymnasium implements.

Pitch

  1. Change the description of the NormalizeReward wrapper to remove incorrect information.
  2. Clarify the description (or answer my questions :) ) by adding relevant references.

Alternatives

No response

Additional context

No response

Checklist

  • I have checked that there is no similar issue in the repo
@pseudo-rnd-thoughts
Copy link
Member

@keraJLi Thanks for the issue. NormalizeReward is the wrapper that makes me most confused so I'm not surprised if the description is wrong.
I think we can't change the implementation so if you can update documentation that would be amazing

For point 3, this seems to have originated in openai/gym#2784 with no justification for it sadly, so won't be surprised if it is wrong

Small questions

  1. what is the mean discounted reward tend towards? Is there a fixed value?

@keraJLi
Copy link
Author

keraJLi commented Dec 17, 2024

For point 3, this seems to have originated in openai/gym#2784 with no justification for it sadly, so won't be surprised if it is wrong

I thought about this again and believe this is what was meant: If the rewards are iid (which they are not generally), $\text{Var}\left[\sum_{t=0}^T \gamma^t r_{T-t}\right] = \sum_{t=0}^T\gamma^{2t}\text{Var}[r_{T-t}]$, and as the sum of $\gamma^2$ is the exponential series, in the limit this is equal to $\frac{1}{1-\gamma^2} \text{Var}[r]$. I believe that the term you get for a non-independent but identical stationary reward distribution should look similar, but depend on the autocorrelation of the reward.

what is the mean discounted reward tend towards? Is there a fixed value?

I do not believe there is a fixed value, since the normalization factor is not directly related to the episodic return. After some more digging, i found that other people seemed to have investigated the same question (and came to the same conclusion): openai/gym#2387 (comment)

if you can update documentation that would be amazing

I will create a pull request in the coming days 🤗

@keraJLi
Copy link
Author

keraJLi commented Dec 18, 2024

Small update: After some calculations, I have confirmed that $(1-\gamma)^2$ is indeed a bound for the variance of the scaled rewards. I have not yet found any paper that mentions this and would propose to leave this out of the description unless I find one. Another possibility could be that I post my derivation in this thread, and link it.
Let me know if you have any preference.

@keraJLi keraJLi linked a pull request Jan 2, 2025 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants