Skip to content

Commit 18040a4

Browse files
Address reviewer feedback on ORPO experimental migration
- Restore ORPO imports in trl/trainer/__init__.py for backward compatibility - Fix deprecation stub naming from ExperimentalORPOTrainer to _ORPOTrainer - Add torch import to deprecation stub for type hints - Fix relative import paths in trl/experimental/orpo/orpo_trainer.py - Update autodoc references to experimental.orpo.ORPOTrainer - Update all documentation references to use experimental namespace - Move ORPO test from test_trainers_args.py to experimental/test_trainers_args.py
1 parent 2d2306a commit 18040a4

File tree

11 files changed

+45
-41
lines changed

11 files changed

+45
-41
lines changed

docs/source/community_tutorials.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Community tutorials are made by active members of the Hugging Face community who
1515
| Instruction tuning | [`SFTTrainer`] | Fine-tuning Google Gemma LLMs using ChatML format with QLoRA | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/fine-tune-google-gemma) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/gemma-lora-example.ipynb) |
1616
| Structured Generation | [`SFTTrainer`] | Fine-tuning Llama-2-7B to generate Persian product catalogs in JSON using QLoRA and PEFT | [Mohammadreza Esmaeilian](https://huggingface.co/Mohammadreza) | [Link](https://huggingface.co/learn/cookbook/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_llm_to_generate_persian_product_catalogs_in_json_format.ipynb) |
1717
| Preference Optimization | [`DPOTrainer`] | Align Mistral-7b using Direct Preference Optimization for human preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/Fine_tune_Mistral_7b_with_DPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mlabonne/llm-course/blob/main/Fine_tune_a_Mistral_7b_model_with_DPO.ipynb) |
18-
| Preference Optimization | [`ORPOTrainer`] | Fine-tuning Llama 3 with ORPO combining instruction tuning and preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/2024-04-19_Fine_tune_Llama_3_with_ORPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eHNWg9gnaXErdAa8_mcvjMupbSS6rDvi) |
18+
| Preference Optimization | [`experimental.orpo.ORPOTrainer`] | Fine-tuning Llama 3 with ORPO combining instruction tuning and preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/2024-04-19_Fine_tune_Llama_3_with_ORPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eHNWg9gnaXErdAa8_mcvjMupbSS6rDvi) |
1919
| Instruction tuning | [`SFTTrainer`] | How to fine-tune open LLMs in 2025 with Hugging Face | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/fine-tune-llms-in-2025) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-llms-in-2025.ipynb) |
2020

2121
### Videos

docs/source/dataset_formats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -395,7 +395,7 @@ Choosing the right dataset type depends on the task you are working on and the s
395395
| [`KTOTrainer`] | [Unpaired preference](#unpaired-preference) or [Preference (explicit prompt recommended)](#preference) |
396396
| [`NashMDTrainer`] | [Prompt-only](#prompt-only) |
397397
| [`OnlineDPOTrainer`] | [Prompt-only](#prompt-only) |
398-
| [`ORPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
398+
| [`experimental.orpo.ORPOTrainer`] | [Preference (explicit prompt recommended)](#preference) |
399399
| [`PPOTrainer`] | Tokenized language modeling |
400400
| [`PRMTrainer`] | [Stepwise supervision](#stepwise-supervision) |
401401
| [`RewardTrainer`] | [Preference (implicit prompt recommended)](#preference) |

docs/source/example_overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
5454
| [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py) | This script shows how to use the [`NashMDTrainer`] to fine-tune a model. |
5555
| [`examples/scripts/online_dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo.py) | This script shows how to use the [`OnlineDPOTrainer`] to fine-tune a model. |
5656
| [`examples/scripts/online_dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo_vlm.py) | This script shows how to use the [`OnlineDPOTrainer`] to fine-tune a a Vision Language Model. |
57-
| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py) | This script shows how to use the [`ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
57+
| [`examples/scripts/orpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/orpo.py) | This script shows how to use the [`experimental.orpo.ORPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
5858
| [`examples/scripts/ppo/ppo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to continue text with positive sentiment or physically descriptive language. |
5959
| [`examples/scripts/ppo/ppo_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py) | This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to generate TL;DR summaries. |
6060
| [`examples/scripts/prm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/prm.py) | This script shows how to use the [`PRMTrainer`] to fine-tune a Process-supervised Reward Model (PRM). |

docs/source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@ Below is the current list of TRL trainers, organized by method type (⚡️ = vL
4141

4242
- [`SFTTrainer`]
4343
- [`DPOTrainer`]
44-
- [`ORPOTrainer`]
4544
- [`experimental.bco.BCOTrainer`] 🧪
45+
- [`experimental.orpo.ORPOTrainer`] 🧪
4646
- [`CPOTrainer`]
4747
- [`KTOTrainer`]
4848

docs/source/orpo_trainer.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -79,9 +79,9 @@ Here are some other factors to consider when choosing a programming language for
7979

8080
## Expected dataset type
8181

82-
ORPO requires a [preference dataset](dataset_formats#preference). The [`ORPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
82+
ORPO requires a [preference dataset](dataset_formats#preference). The [`experimental.orpo.ORPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
8383

84-
Although the [`ORPOTrainer`] supports both explicit and implicit prompts, we recommend using explicit prompts. If provided with an implicit prompt dataset, the trainer will automatically extract the prompt from the `"chosen"` and `"rejected"` columns. For more information, refer to the [preference style](dataset_formats#preference) section.
84+
Although the [`experimental.orpo.ORPOTrainer`] supports both explicit and implicit prompts, we recommend using explicit prompts. If provided with an implicit prompt dataset, the trainer will automatically extract the prompt from the `"chosen"` and `"rejected"` columns. For more information, refer to the [preference style](dataset_formats#preference) section.
8585

8686
## Example script
8787

@@ -121,11 +121,11 @@ While training and evaluating, we record the following reward metrics:
121121

122122
## ORPOTrainer
123123

124-
[[autodoc]] ORPOTrainer
124+
[[autodoc]] experimental.orpo.ORPOTrainer
125125
- train
126126
- save_model
127127
- push_to_hub
128128

129129
## ORPOConfig
130130

131-
[[autodoc]] ORPOConfig
131+
[[autodoc]] experimental.orpo.ORPOConfig
File renamed without changes.

tests/experimental/test_trainers_args.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from transformers import AutoTokenizer
1717

1818
from trl.experimental.bco import BCOConfig, BCOTrainer
19+
from trl.experimental.orpo import ORPOConfig, ORPOTrainer
1920

2021
from ..testing_utils import TrlTestCase, require_sklearn
2122

@@ -68,3 +69,30 @@ def test_bco(self):
6869
assert trainer.args.prompt_sample_size == 512
6970
assert trainer.args.min_density_ratio == 0.2
7071
assert trainer.args.max_density_ratio == 20.0
72+
73+
def test_orpo(self):
74+
model_id = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
75+
tokenizer = AutoTokenizer.from_pretrained(model_id)
76+
dataset = load_dataset("trl-internal-testing/zen", "standard_preference", split="train")
77+
training_args = ORPOConfig(
78+
self.tmp_dir,
79+
max_length=256,
80+
max_prompt_length=64,
81+
max_completion_length=64,
82+
beta=0.5,
83+
disable_dropout=False,
84+
label_pad_token_id=-99,
85+
padding_value=-99,
86+
truncation_mode="keep_start",
87+
# generate_during_eval=True, # ignore this one, it requires wandb
88+
is_encoder_decoder=True,
89+
model_init_kwargs={"trust_remote_code": True},
90+
dataset_num_proc=4,
91+
)
92+
trainer = ORPOTrainer(model=model_id, args=training_args, train_dataset=dataset, processing_class=tokenizer)
93+
assert trainer.args.max_length == 256
94+
assert trainer.args.max_prompt_length == 64
95+
assert trainer.args.max_completion_length == 64
96+
assert trainer.args.beta == 0.5
97+
assert not trainer.args.disable_dropout
98+
assert trainer.args.label_pad_token_id == -99

tests/test_trainers_args.py

Lines changed: 0 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,6 @@
2828
NashMDTrainer,
2929
OnlineDPOConfig,
3030
OnlineDPOTrainer,
31-
ORPOConfig,
32-
ORPOTrainer,
3331
RewardConfig,
3432
RewardTrainer,
3533
SFTConfig,
@@ -248,33 +246,6 @@ def test_online_dpo(self, beta_list):
248246
assert trainer.args.beta == (0.6 if not beta_list else [0.6, 0.7])
249247
assert trainer.args.loss_type == "hinge"
250248

251-
def test_orpo(self):
252-
model_id = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
253-
tokenizer = AutoTokenizer.from_pretrained(model_id)
254-
dataset = load_dataset("trl-internal-testing/zen", "standard_preference", split="train")
255-
training_args = ORPOConfig(
256-
self.tmp_dir,
257-
max_length=256,
258-
max_prompt_length=64,
259-
max_completion_length=64,
260-
beta=0.5,
261-
disable_dropout=False,
262-
label_pad_token_id=-99,
263-
padding_value=-99,
264-
truncation_mode="keep_start",
265-
# generate_during_eval=True, # ignore this one, it requires wandb
266-
is_encoder_decoder=True,
267-
model_init_kwargs={"trust_remote_code": True},
268-
dataset_num_proc=4,
269-
)
270-
trainer = ORPOTrainer(model=model_id, args=training_args, train_dataset=dataset, processing_class=tokenizer)
271-
assert trainer.args.max_length == 256
272-
assert trainer.args.max_prompt_length == 64
273-
assert trainer.args.max_completion_length == 64
274-
assert trainer.args.beta == 0.5
275-
assert not trainer.args.disable_dropout
276-
assert trainer.args.label_pad_token_id == -99
277-
278249
def test_reward(self):
279250
model_id = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
280251
tokenizer = AutoTokenizer.from_pretrained(model_id)

trl/experimental/orpo/orpo_trainer.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,9 @@
4949
from transformers.utils import is_peft_available, is_torch_fx_proxy
5050

5151
from ...data_utils import maybe_apply_chat_template, maybe_extract_prompt
52-
from ..base_trainer import BaseTrainer
52+
from ...trainer.base_trainer import BaseTrainer
5353
from .orpo_config import ORPOConfig
54-
from ..utils import (
54+
from ...trainer.utils import (
5555
DPODataCollatorWithPadding,
5656
add_bos_token_if_needed,
5757
add_eos_token_if_needed,

trl/trainer/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,8 @@
5454
"nash_md_trainer": ["NashMDTrainer"],
5555
"online_dpo_config": ["OnlineDPOConfig"],
5656
"online_dpo_trainer": ["OnlineDPOTrainer"],
57+
"orpo_config": ["ORPOConfig"],
58+
"orpo_trainer": ["ORPOTrainer"],
5759
"ppo_config": ["PPOConfig"],
5860
"ppo_trainer": ["PPOTrainer"],
5961
"prm_config": ["PRMConfig"],
@@ -112,6 +114,8 @@
112114
from .nash_md_trainer import NashMDTrainer
113115
from .online_dpo_config import OnlineDPOConfig
114116
from .online_dpo_trainer import OnlineDPOTrainer
117+
from .orpo_config import ORPOConfig
118+
from .orpo_trainer import ORPOTrainer
115119
from .ppo_config import PPOConfig
116120
from .ppo_trainer import PPOTrainer
117121
from .prm_config import PRMConfig

0 commit comments

Comments
 (0)