Token-level rewards #1302

XanderJC · 2024-01-31T15:18:22Z

Allow passing rewards to step that are List[torch.FloatTensor] but each either of shape 1 (i.e. same as existing functionality) or (response_length) so that we can give reward to individual tokens. Important for more fine-grained feedback.

trl/trainer/ppo_trainer.py

younesbelkada

cc @vwxyzjn @lvwerra what do you think?

HuggingFaceDocBuilderDev · 2024-02-01T03:26:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vwxyzjn

Thanks for the PR! This PR could be useful for research like https://openai.com/research/improving-mathematical-reasoning-with-process-supervision :) Did a quick pass and left a few comments. The two main high-level things are:

keep back compatibility of the API
maybe clean up the logic a bit regarding the if / else statements

vwxyzjn · 2024-02-01T03:25:04Z

trl/trainer/ppo_trainer.py

+        # format scores to token-level scores if needed
+        for i, (score, response) in enumerate(zip(scores, responses)):
+            # make score 1-dimensional
+            if score.dim() > 2:
                raise ValueError(f"Scores must be 1-dimensional - got {score.dim()} for {score}")
-            elif score.dim() == 1:
-                scores[i] = score.squeeze()
+            elif score.dim() == 2:
+                if score.shape[0] != 1 or score.shape[1] != 1:
+                    raise ValueError(f"Scores must be 1-dimensional - got {score.shape} for {score}")
+                else:
+                    score = score.squeeze(1)
+            elif score.dim() == 0:
+                score = score.unsqueeze(0)
+            # make score token-level
+            if score.shape[0] != 1:
+                if score.shape[0] != response.shape[0]:
+                    raise ValueError(
+                        f"Score and response must have the same length if score not scalar- got {score.shape[0]} and {response.shape[0]}"
+                    )
+                else:
+                    scores[i] = score
+                    token_level_score = True
+            else:
+                token_score = torch.zeros_like(response, dtype=float).squeeze().to(self.current_device)
+                token_score[-1] = score
+                scores[i] = token_score
+                token_level_score = False


This seems like way too many if-else statements. The end-of-sequence reward is a special case of token-level rewards. Maybe we could just always convert end-of-sequence reward to token-level rewards internally?

@vwxyzjn happy to change things but let me just quickly explain what I'm doing a bit here in a bit more detail to see if it makes sense first, as it's doing two things which maybe makes the number of if/else statements a bit more palatable

lines 611-619 are doing most of the safety checking because to make sure its back-compatitble we need to accept scores that are either scalar e.g. [1.0] or as a vector e.g. [1.0,2.0,3.0]. I think this means we need to do a lot more dim checks as in the first case I could expect either torch.tensor(1.0) or torch.tensor([1.0]) with dim 0,1 respectively from the user. Similarly with the second either torch.tensor([1.0,2.0,3.0]) or torch.tensor([[1.0,2.0,3.0]]) with dim 1,2 respectively. So here we're just making sure we deal with dim 1, i.e. torch.tensor([1.0]) or torch.tensor([1.0,2.0,3.0]). Theoretically could simplify the code here and just get rid of the logic for turning non dim 1 tensors to dim 1 and throw an error, but it seems helpful to convert - let me know if you think this is a better strategy.

lines 621-633 is doing what you suggest which is then to convert end-of-sequence rewards to token-level rewards as we check the length and if it's 1 we turn it into a token-level reward with the score at the last spot, otherwise we just check the token-level reward is the sam size as the response and if it is then everything's good

I added the token_level_score flag as well to inform later manipulations of the score which I think should depend on which one got passed to trainer.step()

Why do we need a token_level_score? Maybe we can just assume score is token-level. Also the token_level_score appears not tracked for each index, so if at i, token_level_score=True, and i+1, token_level_score=False. Then isn't this information missing?

vwxyzjn · 2024-02-01T03:30:18Z

tests/test_ppo_trainer.py

+        dummy_scores = [
+            torch.tensor([0, 0, 0, 0, 1], device=ppo_trainer.current_device),
+            torch.tensor([0, 0, 0, 0, 0, 2], device=ppo_trainer.current_device),
+        ]


Is this a breaking change requiring users to always pass in token level rewards?

@vwxyzjn I don't think this is a breaking change, but only because I'm assuming that no one is calling self.compute_rewards() outside of self.step(), as I couldn't see why that would be useful.

I set it up so you can pass either token-level or end-of-sequence rewards to step(), which then all get converted to token-level in _step_safety_checker(), meaning everything that gets sent to compute_rewards() would be always token-level.

vwxyzjn · 2024-02-01T03:36:23Z

trl/trainer/ppo_trainer.py

+        total_scores = torch.tensor([score[score != padding_value].sum() for score in padded_scores]).to(
+            self.current_device
+        )
+


We should not only sum it up — the discount factor gamma also comes in. The total_scores should be the discounted episodic return. See https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

@vwxyzjn ah yes good point, I've added that calculation

vwxyzjn

Some further comments

vwxyzjn · 2024-02-19T15:15:12Z

trl/trainer/ppo_trainer.py

-            if score.dim() > 1:
+        # format scores to token-level scores if needed
+        for i, (score, response) in enumerate(zip(scores, responses)):
+            # make score 1-dimensional


This comment seems outdated.

vwxyzjn · 2024-02-19T15:17:54Z

trl/trainer/ppo_trainer.py

+        # format scores to token-level scores if needed
+        for i, (score, response) in enumerate(zip(scores, responses)):
+            # make score 1-dimensional
+            if score.dim() > 2:
                raise ValueError(f"Scores must be 1-dimensional - got {score.dim()} for {score}")
-            elif score.dim() == 1:
-                scores[i] = score.squeeze()
+            elif score.dim() == 2:
+                if score.shape[0] != 1 or score.shape[1] != 1:
+                    raise ValueError(f"Scores must be 1-dimensional - got {score.shape} for {score}")
+                else:
+                    score = score.squeeze(1)
+            elif score.dim() == 0:
+                score = score.unsqueeze(0)
+            # make score token-level
+            if score.shape[0] != 1:
+                if score.shape[0] != response.shape[0]:
+                    raise ValueError(
+                        f"Score and response must have the same length if score not scalar- got {score.shape[0]} and {response.shape[0]}"
+                    )
+                else:
+                    scores[i] = score
+                    token_level_score = True
+            else:
+                token_score = torch.zeros_like(response, dtype=float).squeeze().to(self.current_device)
+                token_score[-1] = score
+                scores[i] = token_score
+                token_level_score = False


Why do we need a token_level_score? Maybe we can just assume score is token-level. Also the token_level_score appears not tracked for each index, so if at i, token_level_score=True, and i+1, token_level_score=False. Then isn't this information missing?

vwxyzjn · 2024-02-19T15:24:04Z

trl/trainer/ppo_trainer.py

+        # sum to episodic rewards
+        rewards = [reward.sum() for reward in rewards]


This is not the discounted return

github-actions · 2024-03-15T15:05:08Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

XanderJC added 2 commits January 31, 2024 14:58

add token reward option

940a3bf

clean comments

cda7310

akx reviewed Jan 31, 2024

View reviewed changes

trl/trainer/ppo_trainer.py Outdated Show resolved Hide resolved

XanderJC added 2 commits January 31, 2024 15:25

no implicit optional

63b6a89

fix calculate total scores

d647707

younesbelkada reviewed Feb 1, 2024

View reviewed changes

younesbelkada requested review from vwxyzjn and lvwerra February 1, 2024 01:06

vwxyzjn reviewed Feb 1, 2024

View reviewed changes

inlcude gamma in total score calculation

c1d24ee

XanderJC requested a review from vwxyzjn February 19, 2024 11:28

vwxyzjn reviewed Feb 19, 2024

View reviewed changes

github-actions bot closed this Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token-level rewards #1302

Token-level rewards #1302

XanderJC commented Jan 31, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Feb 1, 2024

vwxyzjn left a comment

vwxyzjn Feb 1, 2024

XanderJC Feb 1, 2024

vwxyzjn Feb 19, 2024

vwxyzjn Feb 1, 2024

XanderJC Feb 1, 2024

vwxyzjn Feb 1, 2024

XanderJC Feb 1, 2024

vwxyzjn left a comment

vwxyzjn Feb 19, 2024

vwxyzjn Feb 19, 2024

vwxyzjn Feb 19, 2024

github-actions bot commented Mar 15, 2024

		# sum to episodic rewards
		rewards = [reward.sum() for reward in rewards]

Token-level rewards #1302

Token-level rewards #1302

Conversation

XanderJC commented Jan 31, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 1, 2024

vwxyzjn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vwxyzjn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 15, 2024