Skip to content

Commit 6a2e980

Browse files
committed
Squashed commit of the following:
commit 4677cf2 Author: Harras Mansoor <98635627+Harras3@users.noreply.github.com> Date: Wed Nov 5 04:06:13 2025 +0500 Removed Sentiment Tuning Examples (#4424) commit 7a9592b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Nov 4 14:32:04 2025 -0700 🐍 Drop Python 3.9 (#4183) commit 7f15a7f Author: Harras Mansoor <98635627+Harras3@users.noreply.github.com> Date: Wed Nov 5 02:06:31 2025 +0500 Removed outdated warning about batch contamination (#4423) commit 8b0a3ce Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Tue Nov 4 21:37:39 2025 +0100 Update tokenizer apply_chat_template with return_dict=True default (#4448) commit d9f9e2b Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue Nov 4 19:56:58 2025 +0000 Support casting to fp32 when word embeddings are tied to lm_head (#4446) commit 4e138ab Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Tue Nov 4 15:15:23 2025 +0100 Upload notebook with T4 selected (#4449) commit 43253b2 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Mon Nov 3 21:07:31 2025 +0000 Add On-Policy Distillation from thinking labs to paper index. (#4410) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 6f41b18 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Nov 3 10:57:51 2025 -0800 fix: Remove chat template setting from non-SFT trainer scripts (#4437) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
1 parent 9385f50 commit 6a2e980

File tree

112 files changed

+2113
-2004
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

112 files changed

+2113
-2004
lines changed

.github/workflows/tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ jobs:
4040
name: Tests
4141
strategy:
4242
matrix:
43-
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
43+
python-version: ['3.10', '3.11', '3.12', '3.13']
4444
fail-fast: false
4545
runs-on:
4646
group: aws-g4dn-2xlarge

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
repos:
22
- repo: https://github.com/astral-sh/ruff-pre-commit
3-
rev: v0.11.10
3+
rev: v0.13.3
44
hooks:
55
- id: ruff-check
66
types_or: [ python, pyi ]

CONTRIBUTING.md

Lines changed: 0 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -285,24 +285,6 @@ def replicate_str(string: str, n: int, sep: str = " ") -> str:
285285
* **Definite Articles:** Removed definite articles where possible to streamline language. (Eg: Changed "The string to replicate" to "String to replicate")
286286
* **Type Annotations:**
287287
* Always include type definitions, indicating if a parameter is optional and specifying the default value.
288-
* Note that `Optional` means that the value can be `None`, and `*optional*` means that it is not required for the user to pass a value.
289-
E.g., for arguments that can't be `None` and aren't required:
290-
291-
```txt
292-
foo (`int`, *optional*, defaults to `4`):
293-
```
294-
295-
For arguments that can be `None` and are required:
296-
297-
```txt
298-
foo (`Optional[int]`):
299-
```
300-
301-
for arguments that can be `None` and aren't required (in this case, if the default value is `None`, you can omit it):
302-
303-
```txt
304-
foo (`Optional[int]`, *optional*):
305-
```
306288

307289
* **String Defaults:**
308290
* Ensured that default string values are wrapped in double quotes:

docs/source/_toctree.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,6 @@
5353
title: Community Tutorials
5454
- local: lora_without_regret
5555
title: LoRA Without Regret
56-
- local: sentiment_tuning
57-
title: Sentiment Tuning
5856
- local: multi_adapter_rl
5957
title: Multi Adapter RLHF
6058
title: Examples

docs/source/lora_without_regret.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ For reinforcement learning, the blog uses a math reasoning task that we can repr
141141
```python
142142
def strip_reasoning_accuracy_reward(
143143
completions: list[list[dict[str, str]]], solution: list[str], **kwargs
144-
) -> list[Optional[float]]:
144+
) -> list[float | None]:
145145
"""Reward function that strips reasoning tags and checks mathematical accuracy.
146146
147147
This function:

docs/source/paper_index.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -605,3 +605,47 @@ def add_margin(example):
605605

606606
dataset = dataset.map(add_margin)
607607
```
608+
609+
## Distillation
610+
Papers relating to training a student model with the help of a teacher model.
611+
612+
### On-Policy Distillation
613+
**📰 Blog**: https://thinkingmachines.ai/blog/on-policy-distillation/
614+
615+
On-Policy Distillation involves a student model generating rollouts for each batch of training data. We subsequently obtain the probability distributions for each token of the rollouts from both the student and teacher models. The student model is then optimized to minimize the negative Kullback-Leibler (KL) divergence between its own token distributions and those of the teacher model.
616+
617+
| Method | Sampling | Reward signal |
618+
|-------------------------|------------|---------------|
619+
| Supervised finetuning | off-policy | dense |
620+
| Reinforcement learning | on-policy | sparse |
621+
| On-policy distillation | on-policy | dense |
622+
623+
On-Policy Distillation has been shown to outperform SFT, GRPO and can be used to restore generalization capabilities lost during SFT.
624+
625+
Additionally on-policy distillation is more compute efficient and is less prone to overfitting when trained with limited data.
626+
627+
To train a model with on-policy distillation using TRL, you can use the following configuration, with the [`GKDTrainer`] and [`GKDConfig`]:
628+
629+
```python
630+
from trl import GKDConfig
631+
632+
config = GKDConfig(
633+
lmbda=1.0, # student produces rollouts for all batches
634+
beta=1.0, # to ensure reverse-kl as the loss function
635+
teacher_model_name_or_path="teacher-model", # specify the teacher model
636+
637+
)
638+
```
639+
640+
Alternatively, you can use the [`GOLDTrainer`] and [`GOLDConfig`] to perform on-policy distillation with a similar configuration:
641+
642+
```python
643+
from trl.experimental import GOLDConfig
644+
645+
config = GOLDConfig(
646+
lmbda=1.0, # student produces rollouts for all batches
647+
beta=1.0, # to ensure reverse-kl as the loss function
648+
teacher_model_name_or_path="teacher-model", # specify the teacher model
649+
650+
)
651+
```

docs/source/reducing_memory_usage.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -90,9 +90,6 @@ from trl import SFTConfig
9090
training_args = SFTConfig(..., packing=True, max_length=512)
9191
```
9292

93-
> [!WARNING]
94-
> Packing may cause batch contamination, where adjacent sequences influence one another. This can be problematic for some applications. For more details, see [#1230](https://github.com/huggingface/trl/issues/1230).
95-
9693
## Liger for reducing peak memory usage
9794

9895
> [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.

docs/source/sentiment_tuning.md

Lines changed: 0 additions & 29 deletions
This file was deleted.

examples/datasets/hh-rlhf-helpful-base.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414

1515
import re
1616
from dataclasses import dataclass, field
17-
from typing import Optional
1817

1918
from datasets import load_dataset
2019
from huggingface_hub import ModelCard
@@ -42,15 +41,15 @@ class ScriptArguments:
4241
repo_id: str = field(
4342
default="trl-lib/hh-rlhf-helpful-base", metadata={"help": "Hugging Face repository ID to push the dataset to."}
4443
)
45-
dataset_num_proc: Optional[int] = field(
44+
dataset_num_proc: int | None = field(
4645
default=None, metadata={"help": "Number of workers to use for dataset processing."}
4746
)
4847

4948

5049
def common_start(str1: str, str2: str) -> str:
5150
# Zip the two strings and iterate over them together
5251
common_chars = []
53-
for c1, c2 in zip(str1, str2):
52+
for c1, c2 in zip(str1, str2, strict=True):
5453
if c1 == c2:
5554
common_chars.append(c1)
5655
else:

examples/datasets/llava_instruct_mix.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414

1515
import ast
1616
from dataclasses import dataclass, field
17-
from typing import Optional
1817

1918
from datasets import load_dataset
2019
from huggingface_hub import ModelCard
@@ -43,7 +42,7 @@ class ScriptArguments:
4342
default="trl-lib/llava-instruct-mix",
4443
metadata={"help": "Hugging Face repository ID to push the dataset to."},
4544
)
46-
dataset_num_proc: Optional[int] = field(
45+
dataset_num_proc: int | None = field(
4746
default=None,
4847
metadata={"help": "Number of workers to use for dataset processing."},
4948
)

0 commit comments

Comments
 (0)