Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add whiten ops before compute advatanges #887

Merged
merged 2 commits into from
Oct 23, 2023
Merged

Conversation

SingL3
Copy link
Contributor

@SingL3 SingL3 commented Oct 18, 2023

  1. From LLaMA 2 paper, it says:
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
  1. This function is taken from alpaca_farm

Lin Junpeng added 2 commits October 18, 2023 12:00
1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](https://github.com/tatsu-lab/alpaca_farm/blob/64e489c67ea502ab5fa944bebde3078c9722f6ee/src/alpaca_farm/rl/ppo_trainer.py#L86)
@younesbelkada
Copy link
Contributor

cc @vwxyzjn @lvwerra

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Oct 19, 2023

@SingL3, thanks for the change! It makes sense. I have kicked started our benchmark to help understand its effect

image

Copy link
Contributor

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No noticeable difference in performance with whiten_rewards=True. We can even the set the default value to True. WDYT @lvwerra?

test.png
test.png
test.png
test.png
test.png

test.png
test.png
test.png

**Plot commands:**
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
export TAGS_STRING='?tag=v0.4.7-137-g3f90b24&tag=pr-887'
export FOLDER_STRING='v0.4.7-137-g3f90b24_pr-887'
echo "we deal with $TAGS_STRING"

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/hello_world \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_gpt2xl_grad_accu$TAGS_STRING" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/different_models \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo_Cerebras-GPT-6.7B_grad_accu_deepspeed_stage2$TAGS_STRING" \
    --env-ids sentiment-analysis:cerebras/Cerebras-GPT-6.7B \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/deepspeed \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_step_grad_accu$TAGS_STRING" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/grad_accu \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_gpt2$TAGS_STRING" \
        "ppo_falcon_rw_1b$TAGS_STRING" \
        "ppo_peft$TAGS_STRING" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/more_different_models \
    --scan-history

python benchmark/upload_benchmark.py \
    --folder_path="benchmark/trl/$FOLDER_STRING" \
    --path_in_repo="images/benchmark/$FOLDER_STRING" \
    --repo_id="trl-internal-testing/example-images" \
    --repo_type="dataset"


# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
TAGS_STRING='?tag=v0.4.7-137-g3f90b24&tag=pr-887'
FOLDER_STRING='pr-887_vs_baseline'
BASELINE_PR_TAG=v0.4.7-55-g110e672
BASELINE_PR_NAME=PR-662

echo "we deal with $TAGS_STRING"

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/hello_world \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_gpt2xl_grad_accu$TAGS_STRING" \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
        "sentiment_tuning_gpt2xl_grad_accu?tag=$BASELINE_PR_TAG&cl=sentiment gpt2xl ($BASELINE_PR_NAME)" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/different_models \
    --scan-history

python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "ppo$TAGS_STRING" \
        "ppo_gpt2$TAGS_STRING" \
        "ppo_falcon_rw_1b$TAGS_STRING" \
        "ppo_peft$TAGS_STRING" \
    --filters '?we=huggingface&wpn=trl&xaxis=_step&ceik=trl_ppo_trainer_config.value.reward_model&cen=trl_ppo_trainer_config.value.exp_name&metrics=env/reward_mean&metrics=objective/kl' \
        "sentiment_tuning?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb ($BASELINE_PR_NAME)" \
        "sentiment_tuning_gpt2?tag=$BASELINE_PR_TAG&cl=sentiment gpt2 ($BASELINE_PR_NAME)" \
        "sentiment_tuning_falcon_rw_1b?tag=$BASELINE_PR_TAG&cl=sentiment tiiuae/falcon-rw-1b ($BASELINE_PR_NAME)" \
        "sentiment_tuning_peft?tag=$BASELINE_PR_TAG&cl=sentiment lvwerra/gpt2-imdb w/ peft ($BASELINE_PR_NAME)" \
    --env-ids sentiment-analysis:lvwerra/distilbert-imdb \
    --no-check-empty-runs \
    --pc.ncols 2 \
    --pc.ncols-legend 1 \
    --output-filename benchmark/trl/$FOLDER_STRING/more_different_models \
    --scan-history

python benchmark/upload_benchmark.py \
    --folder_path="benchmark/trl/$FOLDER_STRING" \
    --path_in_repo="images/benchmark/$FOLDER_STRING" \
    --repo_id="trl-internal-testing/example-images" \
    --repo_type="dataset"

@vwxyzjn
Copy link
Contributor

vwxyzjn commented Oct 23, 2023

I am going to merge as is. Thanks so much @SingL3!

@vwxyzjn vwxyzjn merged commit 1f3314f into huggingface:main Oct 23, 2023
kashif pushed a commit to kashif/trl that referenced this pull request Oct 27, 2023
* Add whiten ops before compute advatanges

1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](https://github.com/tatsu-lab/alpaca_farm/blob/64e489c67ea502ab5fa944bebde3078c9722f6ee/src/alpaca_farm/rl/ppo_trainer.py#L86)

* Fix type def of self

---------

Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
* Add whiten ops before compute advatanges

1. From LLaMA 2 paper, it says:
```
We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term (β) above.
```
2. This function is taken from [alpaca_farm](https://github.com/tatsu-lab/alpaca_farm/blob/64e489c67ea502ab5fa944bebde3078c9722f6ee/src/alpaca_farm/rl/ppo_trainer.py#L86)

* Fix type def of self

---------

Co-authored-by: Lin Junpeng <linjunpeng@sensetime.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants