Skip to content

Commit 64cfca4

Browse files
Move judges to experimental submodule (#4439)
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
1 parent 97ca1a2 commit 64cfca4

File tree

20 files changed

+625
-491
lines changed

20 files changed

+625
-491
lines changed

docs/source/_toctree.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,6 @@
8787
title: Model Classes
8888
- local: model_utils
8989
title: Model Utilities
90-
- local: judges
91-
title: Judges
9290
- local: callbacks
9391
title: Callbacks
9492
- local: data_utils
@@ -115,6 +113,8 @@
115113
title: GRPO With Replay Buffer
116114
- local: gspo_token
117115
title: GSPO-token
116+
- local: judges
117+
title: Judges
118118
- local: papo_trainer
119119
title: PAPO
120120
- local: xpo_trainer

docs/source/example_overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
4343
| [`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py) | This script shows how to use the [`CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
4444
| [`trl/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a model. |
4545
| [`examples/scripts/dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_vlm.py) | This script shows how to use the [`DPOTrainer`] to fine-tune a Vision Language Model to reduce hallucinations using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset) dataset. |
46-
| [`examples/scripts/evals/judge_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/evals/judge_tldr.py) | This script shows how to use [`HfPairwiseJudge`] or [`OpenAIPairwiseJudge`] to judge model generations. |
46+
| [`examples/scripts/evals/judge_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/evals/judge_tldr.py) | This script shows how to use [`HfPairwiseJudge`] or [`experimental.judges.OpenAIPairwiseJudge`] to judge model generations. |
4747
| [`examples/scripts/gkd.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/gkd.py) | This script shows how to use the [`GKDTrainer`] to fine-tune a model. |
4848
| [`trl/scripts/grpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/grpo.py) | This script shows how to use the [`GRPOTrainer`] to fine-tune a model. |
4949
| [`examples/scripts/grpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/grpo_vlm.py) | This script shows how to use the [`GRPOTrainer`] to fine-tune a multimodal model for reasoning using the [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) dataset. |

docs/source/judges.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Judges
22

33
> [!WARNING]
4-
> TRL Judges is an experimental API which is subject to change at any time.
4+
> TRL Judges is an experimental API which is subject to change at any time. As of TRL v1.0, judges have been moved to the `trl.experimental.judges` module.
55
66
TRL provides judges to easily compare two completions.
77

@@ -13,10 +13,10 @@ pip install trl[judges]
1313

1414
## Using the provided judges
1515

16-
TRL provides several judges out of the box. For example, you can use the [`HfPairwiseJudge`] to compare two completions using a pre-trained model from the Hugging Face model hub:
16+
TRL provides several judges out of the box. For example, you can use the [`experimental.judges.HfPairwiseJudge`] to compare two completions using a pre-trained model from the Hugging Face model hub:
1717

1818
```python
19-
from trl import HfPairwiseJudge
19+
from trl.experimental.judges import HfPairwiseJudge
2020

2121
judge = HfPairwiseJudge()
2222
judge.judge(
@@ -27,12 +27,12 @@ judge.judge(
2727

2828
## Define your own judge
2929

30-
To define your own judge, we provide several base classes that you can subclass. For rank-based judges, you need to subclass [`BaseRankJudge`] and implement the [`BaseRankJudge.judge`] method. For pairwise judges, you need to subclass [`BasePairJudge`] and implement the [`BasePairJudge.judge`] method. If you want to define a judge that doesn't fit into these categories, you need to subclass [`BaseJudge`] and implement the [`BaseJudge.judge`] method.
30+
To define your own judge, we provide several base classes that you can subclass. For rank-based judges, you need to subclass [`experimental.judges.BaseRankJudge`] and implement the [`experimental.judges.BaseRankJudge.judge`] method. For pairwise judges, you need to subclass [`experimental.judges.BasePairJudge`] and implement the [`experimental.judges.BasePairJudge.judge`] method. If you want to define a judge that doesn't fit into these categories, you need to subclass [`experimental.judges.BaseJudge`] and implement the [`experimental.judges.BaseJudge.judge`] method.
3131

3232
As an example, let's define a pairwise judge that prefers shorter completions:
3333

3434
```python
35-
from trl import BasePairwiseJudge
35+
from trl.experimental.judges import BasePairwiseJudge
3636

3737
class PrefersShorterJudge(BasePairwiseJudge):
3838
def judge(self, prompts, completions, shuffle_order=False):
@@ -53,34 +53,34 @@ judge.judge(
5353

5454
### PairRMJudge
5555

56-
[[autodoc]] PairRMJudge
56+
[[autodoc]] trl.experimental.judges.PairRMJudge
5757

5858
### HfPairwiseJudge
5959

60-
[[autodoc]] HfPairwiseJudge
60+
[[autodoc]] trl.experimental.judges.HfPairwiseJudge
6161

6262
### OpenAIPairwiseJudge
6363

64-
[[autodoc]] OpenAIPairwiseJudge
64+
[[autodoc]] trl.experimental.judges.OpenAIPairwiseJudge
6565

6666
### AllTrueJudge
6767

68-
[[autodoc]] AllTrueJudge
68+
[[autodoc]] trl.experimental.judges.AllTrueJudge
6969

7070
## Base classes
7171

7272
### BaseJudge
7373

74-
[[autodoc]] BaseJudge
74+
[[autodoc]] trl.experimental.judges.BaseJudge
7575

7676
### BaseBinaryJudge
7777

78-
[[autodoc]] BaseBinaryJudge
78+
[[autodoc]] trl.experimental.judges.BaseBinaryJudge
7979

8080
### BaseRankJudge
8181

82-
[[autodoc]] BaseRankJudge
82+
[[autodoc]] trl.experimental.judges.BaseRankJudge
8383

8484
### BasePairwiseJudge
8585

86-
[[autodoc]] BasePairwiseJudge
86+
[[autodoc]] trl.experimental.judges.BasePairwiseJudge

docs/source/nash_md_trainer.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ This post-training method was contributed by [Kashif Rasul](https://huggingface.
1414

1515
## Quick start
1616

17-
This example demonstrates how to train a model using the Nash-MD method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
17+
This example demonstrates how to train a model using the Nash-MD method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`experimental.judges.PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
1818

1919
<iframe
2020
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
@@ -28,7 +28,8 @@ Below is the script to train the model:
2828
```python
2929
# train_nash_md.py
3030
from datasets import load_dataset
31-
from trl import NashMDConfig, NashMDTrainer, PairRMJudge
31+
from trl import NashMDConfig, NashMDTrainer
32+
from trl.experimental.judges import PairRMJudge
3233
from transformers import AutoModelForCausalLM, AutoTokenizer
3334

3435
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
@@ -72,7 +73,7 @@ Nash-MD requires a [prompt-only dataset](dataset_formats#prompt-only). The [`Nas
7273
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
7374

7475
```diff
75-
- from trl import PairRMJudge
76+
- from trl.experimental.judges import PairRMJudge
7677
+ from transformers import AutoModelForSequenceClassification
7778

7879
- judge = PairRMJudge()

docs/source/online_dpo_trainer.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ This post-training method was contributed by [Michael Noukhovitch](https://huggi
1414

1515
## Quick start
1616

17-
This example demonstrates how to train a model using the online DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
17+
This example demonstrates how to train a model using the online DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`experimental.judges.PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
1818

1919
<iframe
2020
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
@@ -28,7 +28,8 @@ Below is the script to train the model:
2828
```python
2929
# train_online_dpo.py
3030
from datasets import load_dataset
31-
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
31+
from trl import OnlineDPOConfig, OnlineDPOTrainer
32+
from trl.experimental.judges import PairRMJudge
3233
from transformers import AutoModelForCausalLM, AutoTokenizer
3334

3435
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
@@ -74,7 +75,7 @@ Online DPO only requires a [prompt-only dataset](dataset_formats#prompt-only) (u
7475
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
7576

7677
```diff
77-
- from trl import PairRMJudge
78+
- from trl.experimental.judges import PairRMJudge
7879
+ from transformers import AutoModelForSequenceClassification
7980

8081
- judge = PairRMJudge()

docs/source/xpo_trainer.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ This post-training method was contributed by [Kashif Rasul](https://huggingface.
1717
1818
## Quick start
1919

20-
This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
20+
This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`experimental.judges.PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
2121
<iframe
2222
src="https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt/embed/viewer/default/train?row=0"
2323
frameborder="0"
@@ -30,7 +30,7 @@ Below is the script to train the model:
3030
```python
3131
# train_xpo.py
3232
from datasets import load_dataset
33-
from trl import PairRMJudge
33+
from trl.experimental.judges import PairRMJudge
3434
from trl.experimental.xpo import XPOConfig, XPOTrainer
3535
from transformers import AutoModelForCausalLM, AutoTokenizer
3636

@@ -75,7 +75,7 @@ XPO requires a [prompt-only dataset](dataset_formats#prompt-only). The [`experim
7575
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
7676

7777
```diff
78-
- from trl import PairRMJudge
78+
- from trl.experimental.judges import PairRMJudge
7979
+ from transformers import AutoModelForSequenceClassification
8080

8181
- judge = PairRMJudge()

examples/scripts/evals/judge_tldr.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@
2424
from transformers import HfArgumentParser
2525
from vllm import LLM, SamplingParams
2626

27-
from trl import HfPairwiseJudge, OpenAIPairwiseJudge
27+
from trl.experimental.judges import HfPairwiseJudge, OpenAIPairwiseJudge
2828

2929

3030
"""

examples/scripts/nash_md.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -61,18 +61,16 @@
6161
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer, GenerationConfig
6262

6363
from trl import (
64-
HfPairwiseJudge,
6564
LogCompletionsCallback,
6665
ModelConfig,
6766
NashMDConfig,
6867
NashMDTrainer,
69-
OpenAIPairwiseJudge,
70-
PairRMJudge,
7168
ScriptArguments,
7269
TrlParser,
7370
get_kbit_device_map,
7471
get_quantization_config,
7572
)
73+
from trl.experimental.judges import HfPairwiseJudge, OpenAIPairwiseJudge, PairRMJudge
7674

7775

7876
# Enable logging in a Hugging Face Space

examples/scripts/online_dpo.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,19 +56,17 @@
5656
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer, GenerationConfig
5757

5858
from trl import (
59-
HfPairwiseJudge,
6059
LogCompletionsCallback,
6160
ModelConfig,
6261
OnlineDPOConfig,
6362
OnlineDPOTrainer,
64-
OpenAIPairwiseJudge,
65-
PairRMJudge,
6663
ScriptArguments,
6764
TrlParser,
6865
get_kbit_device_map,
6966
get_peft_config,
7067
get_quantization_config,
7168
)
69+
from trl.experimental.judges import HfPairwiseJudge, OpenAIPairwiseJudge, PairRMJudge
7270

7371

7472
# Enable logging in a Hugging Face Space

examples/scripts/xpo.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,16 +45,14 @@
4545
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer, GenerationConfig
4646

4747
from trl import (
48-
HfPairwiseJudge,
4948
LogCompletionsCallback,
5049
ModelConfig,
51-
OpenAIPairwiseJudge,
52-
PairRMJudge,
5350
ScriptArguments,
5451
TrlParser,
5552
get_kbit_device_map,
5653
get_quantization_config,
5754
)
55+
from trl.experimental.judges import HfPairwiseJudge, OpenAIPairwiseJudge, PairRMJudge
5856
from trl.experimental.xpo import XPOConfig, XPOTrainer
5957

6058

0 commit comments

Comments
 (0)