Skip to content

Commit f1e6377

Browse files
Move PPOTrainer to trl.experimental.ppo (#4482)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
1 parent 01f497e commit f1e6377

File tree

16 files changed

+1037
-975
lines changed

16 files changed

+1037
-975
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Explore how to seamlessly integrate TRL with OpenEnv in our [dedicated documenta
2525

2626
## Overview
2727

28-
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
28+
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Realtive Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
2929

3030
## Highlights
3131

docs/source/_toctree.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,8 +66,6 @@
6666
title: KTO
6767
- local: orpo_trainer
6868
title: ORPO
69-
- local: ppo_trainer
70-
title: PPO
7169
- local: prm_trainer
7270
title: PRM
7371
- local: reward_trainer
@@ -119,6 +117,8 @@
119117
title: Nash-MD
120118
- local: papo_trainer
121119
title: PAPO
120+
- local: ppo_trainer
121+
title: PPO
122122
- local: xpo_trainer
123123
title: XPO
124124
- local: openenv

docs/source/dataset_formats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -396,7 +396,7 @@ Choosing the right dataset type depends on the task you are working on and the s
396396
| [`experimental.nash_md.NashMDTrainer`] | [Prompt-only](#prompt-only) |
397397
| [`OnlineDPOTrainer`] | [Prompt-only](#prompt-only) |
398398
| [`ORPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
399-
| [`PPOTrainer`] | Tokenized language modeling |
399+
| [`experimental.ppo.PPOTrainer`] | Tokenized language modeling |
400400
| [`PRMTrainer`] | [Stepwise supervision](#stepwise-supervision) |
401401
| [`RewardTrainer`] | [Preference (implicit prompt recommended)](#preference) |
402402
| [`RLOOTrainer`] | [Prompt-only](#prompt-only) |

docs/source/example_overview.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ These notebooks are easier to run and are designed for quick experimentation wit
3737

3838
Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl/blob/main/trl/scripts) and [`examples/scripts`](https://github.com/huggingface/trl/blob/main/examples/scripts) directories. They show how to use different trainers such as `SFTTrainer`, `PPOTrainer`, `DPOTrainer`, `GRPOTrainer`, and more.
3939

40-
File | Description |
40+
| File | Description |
4141
| --- | --- |
4242
| [`examples/scripts/bco.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/bco.py) | This script shows how to use the [`KTOTrainer`] with the BCO loss to fine-tune a model to increase instruction-following, truthfulness, honesty, and helpfulness using the [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset. |
4343
| [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py) | This script shows how to use the [`experimental.cpo.CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
@@ -55,8 +55,8 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
5555
| [`examples/scripts/online_dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo.py) | This script shows how to use the [`OnlineDPOTrainer`] to fine-tune a model. |
5656
| [`examples/scripts/online_dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo_vlm.py) | This script shows how to use the [`OnlineDPOTrainer`] to fine-tune a a Vision Language Model. |
5757
| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py) | This script shows how to use the [`ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
58-
| [`examples/scripts/ppo/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to continue text with positive sentiment or physically descriptive language. |
59-
| [`examples/scripts/ppo/ppo_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to generate TL;DR summaries. |
58+
| [`examples/scripts/ppo/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo.py) | This script shows how to use the [`experimental.ppo.PPOTrainer`] to fine-tune a model to improve its ability to continue text with positive sentiment or physically descriptive language. |
59+
| [`examples/scripts/ppo/ppo_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py) | This script shows how to use the [`experimental.ppo.PPOTrainer`] to fine-tune a model to improve its ability to generate TL;DR summaries. |
6060
| [`examples/scripts/prm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/prm.py) | This script shows how to use the [`PRMTrainer`] to fine-tune a Process-supervised Reward Model (PRM). |
6161
| [`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py) | This script shows how to use the [`RewardTrainer`] to train an Outcome Reward Model (ORM) on your own dataset. |
6262
| [`examples/scripts/rloo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/rloo.py) | This script shows how to use the [`RLOOTrainer`] to fine-tune a model to improve its ability to solve math questions. |

docs/source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@ Below is the current list of TRL trainers, organized by method type (⚡️ = vL
2525
- [`GRPOTrainer`] ⚡️
2626
- [`RLOOTrainer`] ⚡️
2727
- [`OnlineDPOTrainer`] ⚡️
28-
- [`PPOTrainer`]
2928
- [`experimental.nash_md.NashMDTrainer`] 🧪 ⚡️
29+
- [`experimental.ppo.PPOTrainer`] 🧪
3030
- [`experimental.xpo.XPOTrainer`] 🧪 ⚡️
3131

3232
### Reward modeling

docs/source/peft_integration.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,8 @@ After training your reward adapter and pushing it to the Hub:
146146

147147
```python
148148
from peft import LoraConfig
149-
from trl import AutoModelForCausalLMWithValueHead, PPOTrainer
149+
from trl import AutoModelForCausalLMWithValueHead
150+
from trl.experimental.ppo import PPOTrainer
150151

151152
model_name = "huggyllama/llama-7b"
152153
rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"

docs/source/ppo_trainer.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# PPO Trainer
22

3+
<Tip warning={true}>
4+
5+
**Deprecation Notice**: PPOTrainer and PPOConfig have been moved to `trl.experimental.ppo` and will be removed from `trl.trainer` in TRL 0.29.0. Please update your imports to use `from trl.experimental.ppo import PPOConfig, PPOTrainer` instead. See [issue #4466](https://github.com/huggingface/trl/issues/4466) for more information.
6+
7+
</Tip>
8+
39
[![model badge](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo,trl)
410

511
TRL supports training LLMs with [Proximal Policy Optimization (PPO)](https://huggingface.co/papers/1707.06347).
@@ -228,11 +234,11 @@ python -m openrlbenchmark.rlops_multi_metrics \
228234

229235
## PPOTrainer
230236

231-
[[autodoc]] PPOTrainer
237+
[[autodoc]] experimental.ppo.PPOTrainer
232238
- train
233239
- save_model
234240
- push_to_hub
235241

236242
## PPOConfig
237243

238-
[[autodoc]] PPOConfig
244+
[[autodoc]] experimental.ppo.PPOConfig

docs/source/reducing_memory_usage.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -274,7 +274,7 @@ training_args = OnlineDPOConfig(..., ds3_gather_for_generation=False)
274274
<hfoption id="PPO">
275275

276276
```python
277-
from trl import PPOConfig
277+
from trl.experimental.ppo import PPOConfig
278278

279279
training_args = PPOConfig(..., ds3_gather_for_generation=False)
280280
```

examples/scripts/ppo/ppo.py

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,15 +34,8 @@
3434
HfArgumentParser,
3535
)
3636

37-
from trl import (
38-
ModelConfig,
39-
PPOConfig,
40-
PPOTrainer,
41-
ScriptArguments,
42-
get_kbit_device_map,
43-
get_peft_config,
44-
get_quantization_config,
45-
)
37+
from trl import ModelConfig, ScriptArguments, get_kbit_device_map, get_peft_config, get_quantization_config
38+
from trl.experimental.ppo import PPOConfig, PPOTrainer
4639

4740

4841
# Enable logging in a Hugging Face Space

examples/scripts/ppo/ppo_tldr.py

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -34,15 +34,8 @@
3434
HfArgumentParser,
3535
)
3636

37-
from trl import (
38-
ModelConfig,
39-
PPOConfig,
40-
PPOTrainer,
41-
ScriptArguments,
42-
get_kbit_device_map,
43-
get_peft_config,
44-
get_quantization_config,
45-
)
37+
from trl import ModelConfig, ScriptArguments, get_kbit_device_map, get_peft_config, get_quantization_config
38+
from trl.experimental.ppo import PPOConfig, PPOTrainer
4639

4740

4841
# Enable logging in a Hugging Face Space

0 commit comments

Comments
 (0)