huggingface
diff --git a/‎.github/workflows/tests.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/tests.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 0 additions & 18 deletions b/‎CONTRIBUTING.md‎
Lines changed: 0 additions & 18 deletions
diff --git a/‎docs/source/_toctree.yml‎
Lines changed: 0 additions & 2 deletions b/‎docs/source/_toctree.yml‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎docs/source/lora_without_regret.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/lora_without_regret.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/paper_index.md‎
Lines changed: 44 additions & 0 deletions b/‎docs/source/paper_index.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎docs/source/reducing_memory_usage.md‎
Lines changed: 0 additions & 3 deletions b/‎docs/source/reducing_memory_usage.md‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎docs/source/sentiment_tuning.md‎
Lines changed: 0 additions & 29 deletions b/‎docs/source/sentiment_tuning.md‎
Lines changed: 0 additions & 29 deletions
diff --git a/‎examples/datasets/hh-rlhf-helpful-base.py‎
Lines changed: 2 additions & 3 deletions b/‎examples/datasets/hh-rlhf-helpful-base.py‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎examples/datasets/llava_instruct_mix.py‎
Lines changed: 1 addition & 2 deletions b/‎examples/datasets/llava_instruct_mix.py‎
Lines changed: 1 addition & 2 deletions
@@ -40,7 +40,7 @@ jobs:
     name: Tests
     strategy:
       matrix:
-        python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
+        python-version: ['3.10', '3.11', '3.12', '3.13']
       fail-fast: false
     runs-on:
       group: aws-g4dn-2xlarge
 
@@ -1,6 +1,6 @@
 repos:
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.11.10
+    rev: v0.13.3
     hooks:
       - id: ruff-check
         types_or: [ python, pyi ]
 
@@ -285,24 +285,6 @@ def replicate_str(string: str, n: int, sep: str = " ") -> str:
 * **Definite Articles:** Removed definite articles where possible to streamline language. (Eg: Changed "The string to replicate" to "String to replicate")
 * **Type Annotations:**
   * Always include type definitions, indicating if a parameter is optional and specifying the default value.
-  * Note that `Optional` means that the value can be `None`, and `*optional*` means that it is not required for the user to pass a value.
-    E.g., for arguments that can't be `None` and aren't required:
-
-    ```txt
-    foo (`int`, *optional*, defaults to `4`):
-    ```
-
-    For arguments that can be `None` and are required:
-
-    ```txt
-    foo (`Optional[int]`):
-    ```
-
-    for arguments that can be `None` and aren't required (in this case, if the default value is `None`, you can omit it):
-
-    ```txt
-    foo (`Optional[int]`, *optional*):
-    ```
 
 * **String Defaults:**
   * Ensured that default string values are wrapped in double quotes:
 
@@ -53,8 +53,6 @@
     title: Community Tutorials
   - local: lora_without_regret
     title: LoRA Without Regret
-  - local: sentiment_tuning
-    title: Sentiment Tuning
   - local: multi_adapter_rl
     title: Multi Adapter RLHF
   title: Examples
 
@@ -141,7 +141,7 @@ For reinforcement learning, the blog uses a math reasoning task that we can repr
 ```python
 def strip_reasoning_accuracy_reward(
     completions: list[list[dict[str, str]]], solution: list[str], **kwargs
-) -> list[Optional[float]]:
+) -> list[float | None]:
     """Reward function that strips reasoning tags and checks mathematical accuracy.
 
     This function:
 
@@ -605,3 +605,47 @@ def add_margin(example):
 
 dataset = dataset.map(add_margin)
 ```
+
+## Distillation
+Papers relating to training a student model with the help of a teacher model.
+
+### On-Policy Distillation
+**📰 Blog**: https://thinkingmachines.ai/blog/on-policy-distillation/
+
+On-Policy Distillation involves a student model generating rollouts for each batch of training data. We subsequently obtain the probability distributions for each token of the rollouts from both the student and teacher models. The student model is then optimized to minimize the negative Kullback-Leibler (KL) divergence between its own token distributions and those of the teacher model.
+
+| Method                  | Sampling   | Reward signal |
+|-------------------------|------------|---------------|
+| Supervised finetuning   | off-policy | dense         |
+| Reinforcement learning  | on-policy  | sparse        |
+| On-policy distillation  | on-policy  | dense         |
+
+On-Policy Distillation has been shown to outperform SFT, GRPO and can be used to restore generalization capabilities lost during SFT.
+
+Additionally on-policy distillation is more compute efficient and is less prone to overfitting when trained with limited data.
+
+To train a model with on-policy distillation using TRL, you can use the following configuration, with the [`GKDTrainer`] and [`GKDConfig`]:
+
+```python
+from trl import GKDConfig
+
+config = GKDConfig(
+    lmbda=1.0, # student produces rollouts for all batches
+    beta=1.0, # to ensure reverse-kl as the loss function
+    teacher_model_name_or_path="teacher-model", # specify the teacher model
+
+)
+```
+
+Alternatively, you can use the [`GOLDTrainer`] and [`GOLDConfig`] to perform on-policy distillation with a similar configuration:
+
+```python
+from trl.experimental import GOLDConfig
+
+config = GOLDConfig(
+    lmbda=1.0, # student produces rollouts for all batches
+    beta=1.0, # to ensure reverse-kl as the loss function
+    teacher_model_name_or_path="teacher-model", # specify the teacher model
+
+)
+```
@@ -90,9 +90,6 @@ from trl import SFTConfig
 training_args = SFTConfig(..., packing=True, max_length=512)
 ```
 
-> [!WARNING]
-> Packing may cause batch contamination, where adjacent sequences influence one another. This can be problematic for some applications. For more details, see [#1230](https://github.com/huggingface/trl/issues/1230).
-
 ## Liger for reducing peak memory usage
 
 > [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.
 
@@ -14,7 +14,6 @@
 
 import re
 from dataclasses import dataclass, field
-from typing import Optional
 
 from datasets import load_dataset
 from huggingface_hub import ModelCard
@@ -42,15 +41,15 @@ class ScriptArguments:
     repo_id: str = field(
         default="trl-lib/hh-rlhf-helpful-base", metadata={"help": "Hugging Face repository ID to push the dataset to."}
     )
-    dataset_num_proc: Optional[int] = field(
+    dataset_num_proc: int | None = field(
         default=None, metadata={"help": "Number of workers to use for dataset processing."}
     )
 
 
 def common_start(str1: str, str2: str) -> str:
     # Zip the two strings and iterate over them together
     common_chars = []
-    for c1, c2 in zip(str1, str2):
+    for c1, c2 in zip(str1, str2, strict=True):
         if c1 == c2:
             common_chars.append(c1)
         else:
 
@@ -14,7 +14,6 @@
 
 import ast
 from dataclasses import dataclass, field
-from typing import Optional
 
 from datasets import load_dataset
 from huggingface_hub import ModelCard
@@ -43,7 +42,7 @@ class ScriptArguments:
         default="trl-lib/llava-instruct-mix",
         metadata={"help": "Hugging Face repository ID to push the dataset to."},
     )
-    dataset_num_proc: Optional[int] = field(
+    dataset_num_proc: int | None = field(
         default=None,
         metadata={"help": "Number of workers to use for dataset processing."},
     )
Original file line number	Diff line number	Diff line change
`@@ -14,7 +14,6 @@`
`14`	`14`
`15`	`15`	`import ast`
`16`	`16`	`from dataclasses import dataclass, field`
`17`		`-from typing import Optional`
`18`	`17`
`19`	`18`	`from datasets import load_dataset`
`20`	`19`	`from huggingface_hub import ModelCard`
`@@ -43,7 +42,7 @@ class ScriptArguments:`
`43`	`42`	`default="trl-lib/llava-instruct-mix",`
`44`	`43`	`metadata={"help": "Hugging Face repository ID to push the dataset to."},`
`45`	`44`	`)`
`46`		`- dataset_num_proc: Optional[int] = field(`
	`45`	`+ dataset_num_proc: int \| None = field(`
`47`	`46`	`default=None,`
`48`	`47`	`metadata={"help": "Number of workers to use for dataset processing."},`
`49`	`48`	`)`