-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token-level rewards #1302
Token-level rewards #1302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This PR could be useful for research like https://openai.com/research/improving-mathematical-reasoning-with-process-supervision :) Did a quick pass and left a few comments. The two main high-level things are:
- keep back compatibility of the API
- maybe clean up the logic a bit regarding the if / else statements
# format scores to token-level scores if needed | ||
for i, (score, response) in enumerate(zip(scores, responses)): | ||
# make score 1-dimensional | ||
if score.dim() > 2: | ||
raise ValueError(f"Scores must be 1-dimensional - got {score.dim()} for {score}") | ||
elif score.dim() == 1: | ||
scores[i] = score.squeeze() | ||
elif score.dim() == 2: | ||
if score.shape[0] != 1 or score.shape[1] != 1: | ||
raise ValueError(f"Scores must be 1-dimensional - got {score.shape} for {score}") | ||
else: | ||
score = score.squeeze(1) | ||
elif score.dim() == 0: | ||
score = score.unsqueeze(0) | ||
# make score token-level | ||
if score.shape[0] != 1: | ||
if score.shape[0] != response.shape[0]: | ||
raise ValueError( | ||
f"Score and response must have the same length if score not scalar- got {score.shape[0]} and {response.shape[0]}" | ||
) | ||
else: | ||
scores[i] = score | ||
token_level_score = True | ||
else: | ||
token_score = torch.zeros_like(response, dtype=float).squeeze().to(self.current_device) | ||
token_score[-1] = score | ||
scores[i] = token_score | ||
token_level_score = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like way too many if-else statements. The end-of-sequence reward is a special case of token-level rewards. Maybe we could just always convert end-of-sequence reward to token-level rewards internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vwxyzjn happy to change things but let me just quickly explain what I'm doing a bit here in a bit more detail to see if it makes sense first, as it's doing two things which maybe makes the number of if/else statements a bit more palatable
- lines 611-619 are doing most of the safety checking because to make sure its back-compatitble we need to accept scores that are either scalar e.g. [1.0] or as a vector e.g. [1.0,2.0,3.0]. I think this means we need to do a lot more dim checks as in the first case I could expect either torch.tensor(1.0) or torch.tensor([1.0]) with dim 0,1 respectively from the user. Similarly with the second either torch.tensor([1.0,2.0,3.0]) or torch.tensor([[1.0,2.0,3.0]]) with dim 1,2 respectively. So here we're just making sure we deal with dim 1, i.e. torch.tensor([1.0]) or torch.tensor([1.0,2.0,3.0]). Theoretically could simplify the code here and just get rid of the logic for turning non dim 1 tensors to dim 1 and throw an error, but it seems helpful to convert - let me know if you think this is a better strategy.
- lines 621-633 is doing what you suggest which is then to convert end-of-sequence rewards to token-level rewards as we check the length and if it's 1 we turn it into a token-level reward with the score at the last spot, otherwise we just check the token-level reward is the sam size as the response and if it is then everything's good
I added the token_level_score flag as well to inform later manipulations of the score which I think should depend on which one got passed to trainer.step()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a token_level_score
? Maybe we can just assume score
is token-level. Also the token_level_score
appears not tracked for each index, so if at i
, token_level_score=True
, and i+1
, token_level_score=False
. Then isn't this information missing?
dummy_scores = [ | ||
torch.tensor([0, 0, 0, 0, 1], device=ppo_trainer.current_device), | ||
torch.tensor([0, 0, 0, 0, 0, 2], device=ppo_trainer.current_device), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a breaking change requiring users to always pass in token level rewards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vwxyzjn I don't think this is a breaking change, but only because I'm assuming that no one is calling self.compute_rewards() outside of self.step(), as I couldn't see why that would be useful.
I set it up so you can pass either token-level or end-of-sequence rewards to step(), which then all get converted to token-level in _step_safety_checker(), meaning everything that gets sent to compute_rewards() would be always token-level.
trl/trainer/ppo_trainer.py
Outdated
total_scores = torch.tensor([score[score != padding_value].sum() for score in padded_scores]).to( | ||
self.current_device | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not only sum it up — the discount factor gamma also comes in. The total_scores should be the discounted episodic return. See https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vwxyzjn ah yes good point, I've added that calculation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some further comments
if score.dim() > 1: | ||
# format scores to token-level scores if needed | ||
for i, (score, response) in enumerate(zip(scores, responses)): | ||
# make score 1-dimensional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems outdated.
# format scores to token-level scores if needed | ||
for i, (score, response) in enumerate(zip(scores, responses)): | ||
# make score 1-dimensional | ||
if score.dim() > 2: | ||
raise ValueError(f"Scores must be 1-dimensional - got {score.dim()} for {score}") | ||
elif score.dim() == 1: | ||
scores[i] = score.squeeze() | ||
elif score.dim() == 2: | ||
if score.shape[0] != 1 or score.shape[1] != 1: | ||
raise ValueError(f"Scores must be 1-dimensional - got {score.shape} for {score}") | ||
else: | ||
score = score.squeeze(1) | ||
elif score.dim() == 0: | ||
score = score.unsqueeze(0) | ||
# make score token-level | ||
if score.shape[0] != 1: | ||
if score.shape[0] != response.shape[0]: | ||
raise ValueError( | ||
f"Score and response must have the same length if score not scalar- got {score.shape[0]} and {response.shape[0]}" | ||
) | ||
else: | ||
scores[i] = score | ||
token_level_score = True | ||
else: | ||
token_score = torch.zeros_like(response, dtype=float).squeeze().to(self.current_device) | ||
token_score[-1] = score | ||
scores[i] = token_score | ||
token_level_score = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a token_level_score
? Maybe we can just assume score
is token-level. Also the token_level_score
appears not tracked for each index, so if at i
, token_level_score=True
, and i+1
, token_level_score=False
. Then isn't this information missing?
# sum to episodic rewards | ||
rewards = [reward.sum() for reward in rewards] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the discounted return
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Allow passing rewards to step that are List[
torch.FloatTensor
] but each either of shape 1 (i.e. same as existing functionality) or (response_length
) so that we can give reward to individual tokens. Important for more fine-grained feedback.