Skip to content

Commit 64c599a

Browse files
committed
Squashed commit of the following:
commit 07b4a84 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:55:24 2025 -0700 Silence experimental warnings when imported in the stable (#4606) commit c55ef4b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Dec 1 12:40:42 2025 -0700 Update How-to guides (#4604) commit c686d7d Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Mon Dec 1 20:34:31 2025 +0100 Raise FutureWarning for classes moved to experimental (#4605) commit c7d172b Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:47:22 2025 -0800 docs: Expand speeding up training guide with acceleration methods (#4428) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit f1dfef0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Dec 1 01:39:08 2025 -0800 docs: Expand training customization examples (#4427) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit eb76389 Author: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com> Date: Sun Nov 30 16:45:21 2025 +0100 [GRPO] Sequence-level TIS & MIS (#4530) commit 0726977 Author: xuanduy04 <65279552+xuanduy04@users.noreply.github.com> Date: Fri Nov 28 23:56:22 2025 +0700 docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index (#4580) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9731d08 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 17:43:38 2025 +0100 Revert "Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation" (#4587) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 84a0bbc Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 16:13:56 2025 +0100 Fix 'generation_config' AttributeError (#4596) commit f67c3f2 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Fri Nov 28 15:46:02 2025 +0100 Remove module-level imports of extra deps in experimental.judges (#4598) commit cb5fdf9 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 27 11:08:26 2025 +0100 Add missing require_bitsandbytes marker to CI tests (#4586) commit 4a3b584 Author: juejuezi <juejuezi.git@foxmail.com> Date: Thu Nov 27 00:11:56 2025 +0800 fix: use shift_labels for metrics when using CP or SP (#4579) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit d2e4315 Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Wed Nov 26 15:40:15 2025 +0100 Revert hotfix Fall back to config.text_config._name_or_path (#4581) commit 357e331 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:55:46 2025 -0700 Move tests for GSPOTokenTrainer to experimental (#4572) commit a59f2cf Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:50:44 2025 -0700 Move `WinRateCallback` to experimental (#4558) Co-authored-by: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit cf431db Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 26 04:11:04 2025 -0700 Fix PPO example (#4556) commit cac9f1d Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Tue Nov 25 21:27:58 2025 +0000 Fix Replay Buffer docs. (#4574) commit 547d924 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Nov 25 09:34:22 2025 -0700 Add `shuffle_dataset` option to `SFTTrainer` (#4564) commit b01f8ca Author: iliasmerigh <91261122+iliasmerigh@users.noreply.github.com> Date: Tue Nov 25 17:33:14 2025 +0100 Fix typo in GRPO description in README (#4573) commit 7856d3b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Tue Nov 25 09:32:39 2025 -0700 Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process (#4571) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit 64d089e Author: lewtun <lewis.c.tunstall@gmail.com> Date: Tue Nov 25 14:39:40 2025 +0100 Reasoning reward (#4563) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 3b7d0e4 Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Tue Nov 25 04:48:06 2025 +0000 Remove Online DPO from stable trainers section in documentation commit 6f3a452 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Nov 24 08:11:49 2025 -0700 Reorder documentation TOC to surface key trainer sections (#4565) commit 46af266 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Mon Nov 24 02:39:25 2025 -0800 docs: Rewrite PEFT integration guide with comprehensive examples (#4421) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit db4f6e5 Author: mingxuetian <108911581+mingxuetian@users.noreply.github.com> Date: Mon Nov 24 09:51:42 2025 +0800 Add `num_generations_eval` parameter for efficient evaluation (#4458) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 07f3c95 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Sun Nov 23 17:33:36 2025 -0800 Move OnlineDPOTrainer to experimental module (#4473) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 4cb1a25 Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Sat Nov 22 23:31:29 2025 +0100 [SFT] Log mean token accuracy from Liger kernel (#4302) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 468b9d4 Author: Susant <acharysusant@gmail.com> Date: Sun Nov 23 03:40:32 2025 +0530 docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer (#4440) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 9bc6206 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Fri Nov 21 17:34:50 2025 -0800 Move PRMTrainer to trl.experimental.prm (#4483) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit f7ac974 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 21 16:01:04 2025 +0100 Update OpenEnv guide with new notebook (#4555) commit c0de042 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 21 15:40:25 2025 +0100 Add GRPO Wordle OpenEnv Colab (#4542) commit 9f8ef40 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 20 22:36:31 2025 -0800 [ORPO] Move ORPOTrainer to experimental (#4480) commit 3bb5d76 Author: Jen Wei <45276133+JenWei0312@users.noreply.github.com> Date: Thu Nov 20 18:53:10 2025 -0700 fix+docs: `device_map=None` for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index (#4551) commit 375b3eb Author: Jonny Li <jonny_li@live.ca> Date: Thu Nov 20 19:42:45 2025 -0500 Add target_parameters to LoraConfig (#4536) commit 237900d Author: Kristian Schwethelm <47533587+kschwethelm@users.noreply.github.com> Date: Thu Nov 20 23:03:20 2025 +0100 Fix bug with VLM processors in prompt-completion completion text-only training (#4553) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 52ed4df Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 20 21:41:23 2025 +0000 Fix style OpenEnv example commit a263946 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 20 14:44:15 2025 +0100 Update OpenEnv guide with latest details (#4552) Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com> commit 1a9ff52 Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Wed Nov 19 15:34:25 2025 +0100 [OpenEnv] browsergym example script (#4539) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 6cbcd94 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:44 2025 +0100 Update OpenEnv example scripts (#4547) commit 8510589 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:20 2025 +0100 Add OpenEnv Script examples to docs (#4533) commit e622196 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Nov 17 03:12:30 2025 -0700 [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward (#4524) commit 1b1242c Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Fri Nov 14 20:51:41 2025 +0100 [OpenEnv] add vllm colocate mode to openenv scripts (#4510) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit f39d18a Author: Fabio Milentiansen Sim <sim.fabio.fms@gmail.com> Date: Fri Nov 14 23:39:02 2025 +0700 fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type (#4526) commit d45eaab Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 12:12:09 2025 +0100 Add vLLM quantization option for colocate (#4496) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit a91d4b3 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 02:19:08 2025 +0100 Prevent upcasting norm layers in `prepare_model_for_kbit_training` (#4457) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 121318e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 17:13:16 2025 -0800 docs: Extend CLI basic usage examples to all supported CLIs (#4425) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 7918320 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 13:20:52 2025 -0700 Remove test trainer args (#4517) commit 102dc41 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:36:43 2025 -0700 Rename `flash-attn` to `flash-attn2` (#4514) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 5de62b0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:05:48 2025 -0700 Add step time metric to GRPO Trainer for performance tracking (#4516) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit f1e6377 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 11:01:19 2025 -0800 Move PPOTrainer to trl.experimental.ppo (#4482) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 01f497e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 10:14:58 2025 -0800 Move NashMDTrainer to experimental module (#4477) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit b6c838a Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 13 16:53:26 2025 +0000 `aws-general-8-plus` runner for Docker build commit ed5c7bb Author: YangKai0616 <kai.yang@intel.com> Date: Fri Nov 14 00:42:48 2025 +0800 [Bug Fix] OnlineDPOTrainer with vLLM Server Mode (#4500) commit ded9bc6 Author: lewtun <lewis.c.tunstall@gmail.com> Date: Thu Nov 13 17:33:59 2025 +0100 Fix Docker images for Liger (#4522) commit fd04760 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 13 11:31:10 2025 +0000 Paper Index: Change `num_completions` to `num_generations` (#4515) commit b7918c0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 20:35:44 2025 -0800 Move GKDTrainer to experimental module (#4474) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 07b5011 Author: Tamoghno Kandar <55907205+tamoghnokandar@users.noreply.github.com> Date: Wed Nov 12 20:07:33 2025 -0800 Replace flash attention2 with kernels-community/flash-attn2 (#4426) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 7a57fd4 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Thu Nov 13 11:16:20 2025 +0800 MiniLLM: Fix arguments in config & add to documentation index (#4518) commit a145eaf Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 16:35:46 2025 -0800 refactor: Move CPOTrainer to experimental module (#4470) commit d2dc717 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Thu Nov 13 00:56:47 2025 +0100 Replace `wandb_log_unique_prompts` with `log_unique_prompts` (#4508) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 799b39b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 16:21:05 2025 -0700 `device_map` and `dtype` to `"auto"` by default (#4509) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit a6a2beb Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 09:42:31 2025 -0700 Add temporary workaround for `lr_scheduler_kwargs` dtype issue in Transformers 4.57.0 (#4513) commit 346701a Author: lewtun <lewis.c.tunstall@gmail.com> Date: Wed Nov 12 17:42:18 2025 +0100 Replace accelerate logging with stdlib in CLI (#4512) commit 4db63af Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Wed Nov 12 02:19:51 2025 +0000 Fix GRPO unsqueeze advantages commit ecb2811 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Wed Nov 12 10:17:22 2025 +0800 Add MiniLLM Trainer (#4504) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 89e4688 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Tue Nov 11 20:36:23 2025 +0100 Add support for images inside tables with Trackio completions logging (#4505) commit 2d3279c Author: lewtun <lewis.c.tunstall@gmail.com> Date: Tue Nov 11 19:22:25 2025 +0100 Tweak description for vLLM sleep mode (#4506) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 02a3477 Author: Luke Hinds <lukehinds@gmail.com> Date: Mon Nov 10 16:41:51 2025 +0000 Fix link to OpenEnv docs (#4502) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit aaed6c1 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Sat Nov 8 08:20:48 2025 -0700 Consistency regarding relative imports (#4498) commit 20760ba Author: burtenshaw <ben.burtenshaw@gmail.com> Date: Fri Nov 7 10:50:50 2025 +0100 [DOCS] update and fix openenv (#4490) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 64cfca4 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 22:47:04 2025 -0800 Move judges to experimental submodule (#4439) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 97ca1a2 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Fri Nov 7 00:20:15 2025 +0000 Fix bugs in CISPO conditions (#4499) commit ffb3dd5 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 16:03:00 2025 -0800 docs: Add PEFT subsection to reducing memory usage guide (#4430) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 43b6541 Author: SolarWindRider <31797478+SolarWindRider@users.noreply.github.com> Date: Fri Nov 7 06:55:34 2025 +0800 Support completion bootstrap for VLM in GRPO/RLOO (#4452) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 642b721 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 6 22:33:00 2025 +0000 ScaleRL: Add CISPO Loss (#4495) commit 32e9c9f Author: Ishita Bhattacharyya <139248026+ishitab02@users.noreply.github.com> Date: Fri Nov 7 03:37:43 2025 +0530 ⛴️ Add kernels to Docker images (#4445) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 1bcfc50 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 13:40:12 2025 -0800 Move XPOTrainer to trl.experimental.xpo (#4485) Co-authored-by: Invidia19 <54266187+Invidia19@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 37942bc Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 6 21:32:03 2025 +0000 Buffer samples based on group level stds. (#4492) commit 66cd02a Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 6 20:58:25 2025 +0100 Add tiny model Qwen3VLForConditionalGeneration to CI (#4494) commit 32febb4 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 6 18:21:56 2025 +0100 Add LFM2 to SFT notebook examples (#4455)
1 parent cd5456e commit 64c599a

File tree

167 files changed

+15506
-9883
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

167 files changed

+15506
-9883
lines changed

.github/workflows/docker-build.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ concurrency:
1313
jobs:
1414
trl:
1515
name: "Build and push TRL Docker image"
16-
runs-on: ubuntu-latest
16+
runs-on:
17+
group: aws-general-8-plus
1718
steps:
1819
- name: Checkout code
1920
uses: actions/checkout@v4
@@ -52,7 +53,8 @@ jobs:
5253

5354
trl-dev:
5455
name: "Build and push TRL Dev Docker image"
55-
runs-on: ubuntu-latest
56+
runs-on:
57+
group: aws-general-8-plus
5658
steps:
5759
- name: Checkout code
5860
uses: actions/checkout@v4

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,11 @@
2121

2222
**OpenEnv Integration:** TRL now supports **[OpenEnv](https://huggingface.co/blog/openenv)**, the open-source framework from Meta for defining, deploying, and interacting with environments in reinforcement learning and agentic workflows.
2323

24-
Explore how to seamlessly integrate TRL with OpenEnv in our [dedicated documentation](openenv).
24+
Explore how to seamlessly integrate TRL with OpenEnv in our [dedicated documentation](https://huggingface.co/docs/trl/openenv).
2525

2626
## Overview
2727

28-
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
28+
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
2929

3030
## Highlights
3131

@@ -92,21 +92,21 @@ trainer.train()
9292
```python
9393
from datasets import load_dataset
9494
from trl import GRPOTrainer
95+
from trl.rewards import accuracy_reward
9596

96-
dataset = load_dataset("trl-lib/tldr", split="train")
97-
98-
# Dummy reward function: count the number of unique characters in the completions
99-
def reward_num_unique_chars(completions, **kwargs):
100-
return [len(set(c)) for c in completions]
97+
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
10198

10299
trainer = GRPOTrainer(
103100
model="Qwen/Qwen2-0.5B-Instruct",
104-
reward_funcs=reward_num_unique_chars,
101+
reward_funcs=accuracy_reward,
105102
train_dataset=dataset,
106103
)
107104
trainer.train()
108105
```
109106

107+
> [NOTE!]
108+
> For reasoning models, use the `reasoning_accuracy_reward()` function for better results.
109+
110110
### `DPOTrainer`
111111

112112
[`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer) implements the popular [Direct Preference Optimization (DPO) algorithm](https://huggingface.co/papers/2305.18290) that was used to post-train [Llama 3](https://huggingface.co/papers/2407.21783) and many other models. Here is a basic example of how to use the `DPOTrainer`:

docker/trl-dev/Dockerfile

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
1-
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime
1+
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
22
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
33
RUN pip install --upgrade pip uv
44
RUN uv pip install --system --no-cache "git+https://github.com/huggingface/trl.git#egg=trl[liger,peft,vlm]"
5-
RUN uv pip install --system hf_transfer liger_kernel trackio peft
6-
RUN uv pip install --system https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
5+
RUN uv pip install --system kernels liger_kernel peft trackio

docker/trl/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime
1+
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
2+
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
23
RUN pip install --upgrade pip uv
3-
RUN uv pip install --system trl[liger,peft,vlm] hf_transfer trackio
4-
RUN uv pip install --system https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
4+
RUN uv pip install --system trl[liger,peft,vlm] kernels trackio

docs/source/_toctree.yml

Lines changed: 40 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,20 @@
1212
- local: paper_index
1313
title: Paper Index
1414
title: Conceptual Guides
15+
- sections: # Sorted alphabetically
16+
- local: dpo_trainer
17+
title: DPO
18+
- local: grpo_trainer
19+
title: GRPO
20+
- local: kto_trainer
21+
title: KTO
22+
- local: reward_trainer
23+
title: Reward
24+
- local: rloo_trainer
25+
title: RLOO
26+
- local: sft_trainer
27+
title: SFT
28+
title: Trainers
1529
- sections:
1630
- local: clis
1731
title: Command Line Interface (CLI)
@@ -55,42 +69,10 @@
5569
title: LoRA Without Regret
5670
title: Examples
5771
- sections:
58-
- sections: # Sorted alphabetically
59-
- local: cpo_trainer
60-
title: CPO
61-
- local: dpo_trainer
62-
title: DPO
63-
- local: online_dpo_trainer
64-
title: Online DPO
65-
- local: gkd_trainer
66-
title: GKD
67-
- local: grpo_trainer
68-
title: GRPO
69-
- local: kto_trainer
70-
title: KTO
71-
- local: nash_md_trainer
72-
title: Nash-MD
73-
- local: orpo_trainer
74-
title: ORPO
75-
- local: ppo_trainer
76-
title: PPO
77-
- local: prm_trainer
78-
title: PRM
79-
- local: reward_trainer
80-
title: Reward
81-
- local: rloo_trainer
82-
title: RLOO
83-
- local: sft_trainer
84-
title: SFT
85-
- local: xpo_trainer
86-
title: XPO
87-
title: Trainers
8872
- local: models
8973
title: Model Classes
9074
- local: model_utils
9175
title: Model Utilities
92-
- local: judges
93-
title: Judges
9476
- local: callbacks
9577
title: Callbacks
9678
- local: data_utils
@@ -105,12 +87,18 @@
10587
- sections:
10688
- local: experimental_overview
10789
title: Experimental Overview
90+
- local: openenv
91+
title: OpenEnv Integration
10892
- local: bema_for_reference_model # Sorted alphabetically
10993
title: BEMA for Reference Model
11094
- local: bco_trainer
11195
title: BCO
96+
- local: cpo_trainer
97+
title: CPO
11298
- local: gfpo
11399
title: GFPO
100+
- local: gkd_trainer
101+
title: GKD
114102
- local: gold_trainer
115103
title: GOLD
116104
- local: grpo_with_replay_buffer
@@ -123,4 +111,24 @@
123111
title: OpenEnv Integration
124112
- local: papo_trainer
125113
title: PAPO
114+
- local: judges
115+
title: Judges
116+
- local: minillm_trainer
117+
title: MiniLLM
118+
- local: nash_md_trainer
119+
title: Nash-MD
120+
- local: online_dpo_trainer
121+
title: Online DPO
122+
- local: orpo_trainer
123+
title: ORPO
124+
- local: papo_trainer
125+
title: PAPO
126+
- local: ppo_trainer
127+
title: PPO
128+
- local: prm_trainer
129+
title: PRM
130+
- local: winrate_callback
131+
title: WinRateCallback
132+
- local: xpo_trainer
133+
title: XPO
126134
title: Experimental

docs/source/callbacks.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,6 @@
88

99
[[autodoc]] RichProgressCallback
1010

11-
## WinRateCallback
12-
13-
[[autodoc]] WinRateCallback
14-
1511
## LogCompletionsCallback
1612

1713
[[autodoc]] LogCompletionsCallback

0 commit comments

Comments
 (0)