You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/example_overview.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
43
43
|[`examples/scripts/cpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/cpo.py)| This script shows how to use the [`CPOTrainer`] to fine-tune a model to increase helpfulness and harmlessness using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset. |
44
44
|[`trl/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/dpo.py)| This script shows how to use the [`DPOTrainer`] to fine-tune a model. |
45
45
|[`examples/scripts/dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_vlm.py)| This script shows how to use the [`DPOTrainer`] to fine-tune a Vision Language Model to reduce hallucinations using the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset) dataset. |
46
-
|[`examples/scripts/evals/judge_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/evals/judge_tldr.py)| This script shows how to use [`HfPairwiseJudge`] or [`OpenAIPairwiseJudge`] to judge model generations. |
46
+
|[`examples/scripts/evals/judge_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/evals/judge_tldr.py)| This script shows how to use [`HfPairwiseJudge`] or [`experimental.judges.OpenAIPairwiseJudge`] to judge model generations. |
47
47
|[`examples/scripts/gkd.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/gkd.py)| This script shows how to use the [`GKDTrainer`] to fine-tune a model. |
48
48
|[`trl/scripts/grpo.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/grpo.py)| This script shows how to use the [`GRPOTrainer`] to fine-tune a model. |
49
49
|[`examples/scripts/grpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/grpo_vlm.py)| This script shows how to use the [`GRPOTrainer`] to fine-tune a multimodal model for reasoning using the [lmms-lab/multimodal-open-r1-8k-verified](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) dataset. |
Copy file name to clipboardExpand all lines: docs/source/judges.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Judges
2
2
3
3
> [!WARNING]
4
-
> TRL Judges is an experimental API which is subject to change at any time.
4
+
> TRL Judges is an experimental API which is subject to change at any time. As of TRL v1.0, judges have been moved to the `trl.experimental.judges` module.
5
5
6
6
TRL provides judges to easily compare two completions.
7
7
@@ -13,10 +13,10 @@ pip install trl[judges]
13
13
14
14
## Using the provided judges
15
15
16
-
TRL provides several judges out of the box. For example, you can use the [`HfPairwiseJudge`] to compare two completions using a pre-trained model from the Hugging Face model hub:
16
+
TRL provides several judges out of the box. For example, you can use the [`experimental.judges.HfPairwiseJudge`] to compare two completions using a pre-trained model from the Hugging Face model hub:
17
17
18
18
```python
19
-
from trl import HfPairwiseJudge
19
+
from trl.experimental.judgesimport HfPairwiseJudge
20
20
21
21
judge = HfPairwiseJudge()
22
22
judge.judge(
@@ -27,12 +27,12 @@ judge.judge(
27
27
28
28
## Define your own judge
29
29
30
-
To define your own judge, we provide several base classes that you can subclass. For rank-based judges, you need to subclass [`BaseRankJudge`] and implement the [`BaseRankJudge.judge`] method. For pairwise judges, you need to subclass [`BasePairJudge`] and implement the [`BasePairJudge.judge`] method. If you want to define a judge that doesn't fit into these categories, you need to subclass [`BaseJudge`] and implement the [`BaseJudge.judge`] method.
30
+
To define your own judge, we provide several base classes that you can subclass. For rank-based judges, you need to subclass [`experimental.judges.BaseRankJudge`] and implement the [`experimental.judges.BaseRankJudge.judge`] method. For pairwise judges, you need to subclass [`experimental.judges.BasePairJudge`] and implement the [`experimental.judges.BasePairJudge.judge`] method. If you want to define a judge that doesn't fit into these categories, you need to subclass [`experimental.judges.BaseJudge`] and implement the [`experimental.judges.BaseJudge.judge`] method.
31
31
32
32
As an example, let's define a pairwise judge that prefers shorter completions:
33
33
34
34
```python
35
-
from trl import BasePairwiseJudge
35
+
from trl.experimental.judgesimport BasePairwiseJudge
Copy file name to clipboardExpand all lines: docs/source/nash_md_trainer.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ This post-training method was contributed by [Kashif Rasul](https://huggingface.
14
14
15
15
## Quick start
16
16
17
-
This example demonstrates how to train a model using the Nash-MD method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
17
+
This example demonstrates how to train a model using the Nash-MD method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`experimental.judges.PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
@@ -28,7 +28,8 @@ Below is the script to train the model:
28
28
```python
29
29
# train_nash_md.py
30
30
from datasets import load_dataset
31
-
from trl import NashMDConfig, NashMDTrainer, PairRMJudge
31
+
from trl import NashMDConfig, NashMDTrainer
32
+
from trl.experimental.judges import PairRMJudge
32
33
from transformers import AutoModelForCausalLM, AutoTokenizer
33
34
34
35
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
@@ -72,7 +73,7 @@ Nash-MD requires a [prompt-only dataset](dataset_formats#prompt-only). The [`Nas
72
73
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
73
74
74
75
```diff
75
-
- from trl import PairRMJudge
76
+
- from trl.experimental.judges import PairRMJudge
76
77
+ from transformers import AutoModelForSequenceClassification
Copy file name to clipboardExpand all lines: docs/source/online_dpo_trainer.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ This post-training method was contributed by [Michael Noukhovitch](https://huggi
14
14
15
15
## Quick start
16
16
17
-
This example demonstrates how to train a model using the online DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
17
+
This example demonstrates how to train a model using the online DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`experimental.judges.PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
@@ -28,7 +28,8 @@ Below is the script to train the model:
28
28
```python
29
29
# train_online_dpo.py
30
30
from datasets import load_dataset
31
-
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
31
+
from trl import OnlineDPOConfig, OnlineDPOTrainer
32
+
from trl.experimental.judges import PairRMJudge
32
33
from transformers import AutoModelForCausalLM, AutoTokenizer
33
34
34
35
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
@@ -74,7 +75,7 @@ Online DPO only requires a [prompt-only dataset](dataset_formats#prompt-only) (u
74
75
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
75
76
76
77
```diff
77
-
- from trl import PairRMJudge
78
+
- from trl.experimental.judges import PairRMJudge
78
79
+ from transformers import AutoModelForSequenceClassification
Copy file name to clipboardExpand all lines: docs/source/xpo_trainer.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ This post-training method was contributed by [Kashif Rasul](https://huggingface.
17
17
18
18
## Quick start
19
19
20
-
This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
20
+
This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`experimental.judges.PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
@@ -30,7 +30,7 @@ Below is the script to train the model:
30
30
```python
31
31
# train_xpo.py
32
32
from datasets import load_dataset
33
-
from trl import PairRMJudge
33
+
from trl.experimental.judgesimport PairRMJudge
34
34
from trl.experimental.xpo import XPOConfig, XPOTrainer
35
35
from transformers import AutoModelForCausalLM, AutoTokenizer
36
36
@@ -75,7 +75,7 @@ XPO requires a [prompt-only dataset](dataset_formats#prompt-only). The [`experim
75
75
Instead of a judge, you can chose to use a reward model -- see [Reward Bench](https://huggingface.co/spaces/allenai/reward-bench) for a leaderboard of public models you can use. Below is a code example showing how to replace a judge with the [trl-lib/Qwen2-0.5B-Reward](https://huggingface.co/trl-lib/Qwen2-0.5B-Reward) model:
76
76
77
77
```diff
78
-
- from trl import PairRMJudge
78
+
- from trl.experimental.judges import PairRMJudge
79
79
+ from transformers import AutoModelForSequenceClassification
0 commit comments