-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Linearise reward transform #2681
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2681
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
Just a few suggestions
Args: | ||
in_keys: The keys under which the multi-objective rewards are found | ||
out_keys: The keys under which single-objective rewards should be written. Defaults to `in_keys`. | ||
weights: Dictates how to weight each reward when summing them. Defaults to `[1.0, 1.0, ...]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the typing? Is a tensor appropriate? What about a differentiable tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the typing, List[float] | Tensor | None
. I didn't give any thoughts for the tensor to be differentiable -- I believe here it would give unexpected results as weights
are registered as a non-parameter buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can register anything as a buffer, it will just not be listed in the parameters().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- my understanding about this is that it would therefore be ignored by any optimizer.
What I mean by
it would give unexpected results
is that if you pass a differentiable tensor weights
to this module in the hope to optimise it somehow, you won't ever get any gradients for it.
Maybe I'm misunderstanding your comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no, my comment was just this:
If your weight is differentiable, is the reward going to be differentiable too?
I think it will so that's great (I can imagine some researcher, somewhere, trying to hack something like that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it -- the answer is that as it, no.
I also see the value, but maybe we can keep this to anther PR ?
MR suggestions Co-authored-by: Vincent Moens <vincentmoens@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
From the failing tests it seems a |
Description
Adds a transform to linearise multi-objective rewards into a single one, via a weighted sum. Essentially, this is a reproduction of mo-gym's
LinearReward
wrapper.Motivation and Context
This feature provides a naive way to address multi-objective environments.
Types of changes
What types of changes does your code introduce? Remove all that do not apply:
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.If you are unsure about any of these, don't hesitate to ask. We are here to help!