Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token-level rewards #1302

Closed
wants to merge 5 commits into from
Closed

Conversation

XanderJC
Copy link

Allow passing rewards to step that are List[torch.FloatTensor] but each either of shape 1 (i.e. same as existing functionality) or (response_length) so that we can give reward to individual tokens. Important for more fine-grained feedback.

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @vwxyzjn @lvwerra what do you think?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This PR could be useful for research like https://openai.com/research/improving-mathematical-reasoning-with-process-supervision :) Did a quick pass and left a few comments. The two main high-level things are:

  1. keep back compatibility of the API
  2. maybe clean up the logic a bit regarding the if / else statements

Comment on lines +608 to +633
# format scores to token-level scores if needed
for i, (score, response) in enumerate(zip(scores, responses)):
# make score 1-dimensional
if score.dim() > 2:
raise ValueError(f"Scores must be 1-dimensional - got {score.dim()} for {score}")
elif score.dim() == 1:
scores[i] = score.squeeze()
elif score.dim() == 2:
if score.shape[0] != 1 or score.shape[1] != 1:
raise ValueError(f"Scores must be 1-dimensional - got {score.shape} for {score}")
else:
score = score.squeeze(1)
elif score.dim() == 0:
score = score.unsqueeze(0)
# make score token-level
if score.shape[0] != 1:
if score.shape[0] != response.shape[0]:
raise ValueError(
f"Score and response must have the same length if score not scalar- got {score.shape[0]} and {response.shape[0]}"
)
else:
scores[i] = score
token_level_score = True
else:
token_score = torch.zeros_like(response, dtype=float).squeeze().to(self.current_device)
token_score[-1] = score
scores[i] = token_score
token_level_score = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like way too many if-else statements. The end-of-sequence reward is a special case of token-level rewards. Maybe we could just always convert end-of-sequence reward to token-level rewards internally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vwxyzjn happy to change things but let me just quickly explain what I'm doing a bit here in a bit more detail to see if it makes sense first, as it's doing two things which maybe makes the number of if/else statements a bit more palatable

  1. lines 611-619 are doing most of the safety checking because to make sure its back-compatitble we need to accept scores that are either scalar e.g. [1.0] or as a vector e.g. [1.0,2.0,3.0]. I think this means we need to do a lot more dim checks as in the first case I could expect either torch.tensor(1.0) or torch.tensor([1.0]) with dim 0,1 respectively from the user. Similarly with the second either torch.tensor([1.0,2.0,3.0]) or torch.tensor([[1.0,2.0,3.0]]) with dim 1,2 respectively. So here we're just making sure we deal with dim 1, i.e. torch.tensor([1.0]) or torch.tensor([1.0,2.0,3.0]). Theoretically could simplify the code here and just get rid of the logic for turning non dim 1 tensors to dim 1 and throw an error, but it seems helpful to convert - let me know if you think this is a better strategy.
  2. lines 621-633 is doing what you suggest which is then to convert end-of-sequence rewards to token-level rewards as we check the length and if it's 1 we turn it into a token-level reward with the score at the last spot, otherwise we just check the token-level reward is the sam size as the response and if it is then everything's good

I added the token_level_score flag as well to inform later manipulations of the score which I think should depend on which one got passed to trainer.step()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a token_level_score? Maybe we can just assume score is token-level. Also the token_level_score appears not tracked for each index, so if at i, token_level_score=True, and i+1, token_level_score=False. Then isn't this information missing?

Comment on lines +577 to +580
dummy_scores = [
torch.tensor([0, 0, 0, 0, 1], device=ppo_trainer.current_device),
torch.tensor([0, 0, 0, 0, 0, 2], device=ppo_trainer.current_device),
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking change requiring users to always pass in token level rewards?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vwxyzjn I don't think this is a breaking change, but only because I'm assuming that no one is calling self.compute_rewards() outside of self.step(), as I couldn't see why that would be useful.

I set it up so you can pass either token-level or end-of-sequence rewards to step(), which then all get converted to token-level in _step_safety_checker(), meaning everything that gets sent to compute_rewards() would be always token-level.

Comment on lines 857 to 860
total_scores = torch.tensor([score[score != padding_value].sum() for score in padded_scores]).to(
self.current_device
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not only sum it up — the discount factor gamma also comes in. The total_scores should be the discounted episodic return. See https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vwxyzjn ah yes good point, I've added that calculation

@XanderJC XanderJC requested a review from vwxyzjn February 19, 2024 11:28
Copy link
Contributor

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some further comments

if score.dim() > 1:
# format scores to token-level scores if needed
for i, (score, response) in enumerate(zip(scores, responses)):
# make score 1-dimensional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems outdated.

Comment on lines +608 to +633
# format scores to token-level scores if needed
for i, (score, response) in enumerate(zip(scores, responses)):
# make score 1-dimensional
if score.dim() > 2:
raise ValueError(f"Scores must be 1-dimensional - got {score.dim()} for {score}")
elif score.dim() == 1:
scores[i] = score.squeeze()
elif score.dim() == 2:
if score.shape[0] != 1 or score.shape[1] != 1:
raise ValueError(f"Scores must be 1-dimensional - got {score.shape} for {score}")
else:
score = score.squeeze(1)
elif score.dim() == 0:
score = score.unsqueeze(0)
# make score token-level
if score.shape[0] != 1:
if score.shape[0] != response.shape[0]:
raise ValueError(
f"Score and response must have the same length if score not scalar- got {score.shape[0]} and {response.shape[0]}"
)
else:
scores[i] = score
token_level_score = True
else:
token_score = torch.zeros_like(response, dtype=float).squeeze().to(self.current_device)
token_score[-1] = score
scores[i] = token_score
token_level_score = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a token_level_score? Maybe we can just assume score is token-level. Also the token_level_score appears not tracked for each index, so if at i, token_level_score=True, and i+1, token_level_score=False. Then isn't this information missing?

Comment on lines +1376 to +1377
# sum to episodic rewards
rewards = [reward.sum() for reward in rewards]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the discounted return

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants