Skip to content

Commit b6815e3

Browse files
committed
Squashed commit of the following:
commit 52ed4df Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 20 21:41:23 2025 +0000 Fix style OpenEnv example commit a263946 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 20 14:44:15 2025 +0100 Update OpenEnv guide with latest details (#4552) Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com> commit 1a9ff52 Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Wed Nov 19 15:34:25 2025 +0100 [OpenEnv] browsergym example script (#4539) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 6cbcd94 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:44 2025 +0100 Update OpenEnv example scripts (#4547) commit 8510589 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Wed Nov 19 14:39:20 2025 +0100 Add OpenEnv Script examples to docs (#4533) commit e622196 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Mon Nov 17 03:12:30 2025 -0700 [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward (#4524) commit 1b1242c Author: Kashif Rasul <kashif.rasul@gmail.com> Date: Fri Nov 14 20:51:41 2025 +0100 [OpenEnv] add vllm colocate mode to openenv scripts (#4510) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit f39d18a Author: Fabio Milentiansen Sim <sim.fabio.fms@gmail.com> Date: Fri Nov 14 23:39:02 2025 +0700 fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type (#4526) commit d45eaab Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 12:12:09 2025 +0100 Add vLLM quantization option for colocate (#4496) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> commit a91d4b3 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Fri Nov 14 02:19:08 2025 +0100 Prevent upcasting norm layers in `prepare_model_for_kbit_training` (#4457) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 121318e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 17:13:16 2025 -0800 docs: Extend CLI basic usage examples to all supported CLIs (#4425) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 7918320 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 13:20:52 2025 -0700 Remove test trainer args (#4517) commit 102dc41 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:36:43 2025 -0700 Rename `flash-attn` to `flash-attn2` (#4514) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 5de62b0 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Thu Nov 13 12:05:48 2025 -0700 Add step time metric to GRPO Trainer for performance tracking (#4516) Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> commit f1e6377 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 11:01:19 2025 -0800 Move PPOTrainer to trl.experimental.ppo (#4482) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 01f497e Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 13 10:14:58 2025 -0800 Move NashMDTrainer to experimental module (#4477) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit b6c838a Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Thu Nov 13 16:53:26 2025 +0000 `aws-general-8-plus` runner for Docker build commit ed5c7bb Author: YangKai0616 <kai.yang@intel.com> Date: Fri Nov 14 00:42:48 2025 +0800 [Bug Fix] OnlineDPOTrainer with vLLM Server Mode (#4500) commit ded9bc6 Author: lewtun <lewis.c.tunstall@gmail.com> Date: Thu Nov 13 17:33:59 2025 +0100 Fix Docker images for Liger (#4522) commit fd04760 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 13 11:31:10 2025 +0000 Paper Index: Change `num_completions` to `num_generations` (#4515) commit b7918c0 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 20:35:44 2025 -0800 Move GKDTrainer to experimental module (#4474) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 07b5011 Author: Tamoghno Kandar <55907205+tamoghnokandar@users.noreply.github.com> Date: Wed Nov 12 20:07:33 2025 -0800 Replace flash attention2 with kernels-community/flash-attn2 (#4426) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit 7a57fd4 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Thu Nov 13 11:16:20 2025 +0800 MiniLLM: Fix arguments in config & add to documentation index (#4518) commit a145eaf Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Wed Nov 12 16:35:46 2025 -0800 refactor: Move CPOTrainer to experimental module (#4470) commit d2dc717 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Thu Nov 13 00:56:47 2025 +0100 Replace `wandb_log_unique_prompts` with `log_unique_prompts` (#4508) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 799b39b Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 16:21:05 2025 -0700 `device_map` and `dtype` to `"auto"` by default (#4509) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit a6a2beb Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Wed Nov 12 09:42:31 2025 -0700 Add temporary workaround for `lr_scheduler_kwargs` dtype issue in Transformers 4.57.0 (#4513) commit 346701a Author: lewtun <lewis.c.tunstall@gmail.com> Date: Wed Nov 12 17:42:18 2025 +0100 Replace accelerate logging with stdlib in CLI (#4512) commit 4db63af Author: Quentin Gallouédec <gallouedec.quentin@gmail.com> Date: Wed Nov 12 02:19:51 2025 +0000 Fix GRPO unsqueeze advantages commit ecb2811 Author: Yuxian Gu <guyx21@mails.tsinghua.edu.cn> Date: Wed Nov 12 10:17:22 2025 +0800 Add MiniLLM Trainer (#4504) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 89e4688 Author: Taha Yassine <40228615+taha-yassine@users.noreply.github.com> Date: Tue Nov 11 20:36:23 2025 +0100 Add support for images inside tables with Trackio completions logging (#4505) commit 2d3279c Author: lewtun <lewis.c.tunstall@gmail.com> Date: Tue Nov 11 19:22:25 2025 +0100 Tweak description for vLLM sleep mode (#4506) Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 02a3477 Author: Luke Hinds <lukehinds@gmail.com> Date: Mon Nov 10 16:41:51 2025 +0000 Fix link to OpenEnv docs (#4502) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> commit aaed6c1 Author: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Date: Sat Nov 8 08:20:48 2025 -0700 Consistency regarding relative imports (#4498) commit 20760ba Author: burtenshaw <ben.burtenshaw@gmail.com> Date: Fri Nov 7 10:50:50 2025 +0100 [DOCS] update and fix openenv (#4490) Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 64cfca4 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 22:47:04 2025 -0800 Move judges to experimental submodule (#4439) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 97ca1a2 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Fri Nov 7 00:20:15 2025 +0000 Fix bugs in CISPO conditions (#4499) commit ffb3dd5 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 16:03:00 2025 -0800 docs: Add PEFT subsection to reducing memory usage guide (#4430) Co-authored-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> commit 43b6541 Author: SolarWindRider <31797478+SolarWindRider@users.noreply.github.com> Date: Fri Nov 7 06:55:34 2025 +0800 Support completion bootstrap for VLM in GRPO/RLOO (#4452) Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 642b721 Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 6 22:33:00 2025 +0000 ScaleRL: Add CISPO Loss (#4495) commit 32e9c9f Author: Ishita Bhattacharyya <139248026+ishitab02@users.noreply.github.com> Date: Fri Nov 7 03:37:43 2025 +0530 ⛴️ Add kernels to Docker images (#4445) Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 1bcfc50 Author: Behrooz Azarkhalili <80390531+behroozazarkhalili@users.noreply.github.com> Date: Thu Nov 6 13:40:12 2025 -0800 Move XPOTrainer to trl.experimental.xpo (#4485) Co-authored-by: Invidia19 <54266187+Invidia19@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> commit 37942bc Author: Pramodith Ballapuram <16939722+pramodith@users.noreply.github.com> Date: Thu Nov 6 21:32:03 2025 +0000 Buffer samples based on group level stds. (#4492) commit 66cd02a Author: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com> Date: Thu Nov 6 20:58:25 2025 +0100 Add tiny model Qwen3VLForConditionalGeneration to CI (#4494) commit 32febb4 Author: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com> Date: Thu Nov 6 18:21:56 2025 +0100 Add LFM2 to SFT notebook examples (#4455)
1 parent c2db596 commit b6815e3

File tree

126 files changed

+8028
-5771
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

126 files changed

+8028
-5771
lines changed

.github/workflows/docker-build.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ concurrency:
1313
jobs:
1414
trl:
1515
name: "Build and push TRL Docker image"
16-
runs-on: ubuntu-latest
16+
runs-on:
17+
group: aws-general-8-plus
1718
steps:
1819
- name: Checkout code
1920
uses: actions/checkout@v4
@@ -52,7 +53,8 @@ jobs:
5253

5354
trl-dev:
5455
name: "Build and push TRL Dev Docker image"
55-
runs-on: ubuntu-latest
56+
runs-on:
57+
group: aws-general-8-plus
5658
steps:
5759
- name: Checkout code
5860
uses: actions/checkout@v4

README.md

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,11 @@
2121

2222
**OpenEnv Integration:** TRL now supports **[OpenEnv](https://huggingface.co/blog/openenv)**, the open-source framework from Meta for defining, deploying, and interacting with environments in reinforcement learning and agentic workflows.
2323

24-
Explore how to seamlessly integrate TRL with OpenEnv in our [dedicated documentation](openenv).
24+
Explore how to seamlessly integrate TRL with OpenEnv in our [dedicated documentation](https://huggingface.co/docs/trl/openenv).
2525

2626
## Overview
2727

28-
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
28+
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Realtive Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
2929

3030
## Highlights
3131

@@ -92,16 +92,13 @@ trainer.train()
9292
```python
9393
from datasets import load_dataset
9494
from trl import GRPOTrainer
95+
from trl.rewards import accuracy_reward
9596

96-
dataset = load_dataset("trl-lib/tldr", split="train")
97-
98-
# Dummy reward function: count the number of unique characters in the completions
99-
def reward_num_unique_chars(completions, **kwargs):
100-
return [len(set(c)) for c in completions]
97+
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
10198

10299
trainer = GRPOTrainer(
103100
model="Qwen/Qwen2-0.5B-Instruct",
104-
reward_funcs=reward_num_unique_chars,
101+
reward_funcs=accuracy_reward,
105102
train_dataset=dataset,
106103
)
107104
trainer.train()

docker/trl-dev/Dockerfile

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
1-
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime
1+
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
22
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
33
RUN pip install --upgrade pip uv
44
RUN uv pip install --system --no-cache "git+https://github.com/huggingface/trl.git#egg=trl[liger,peft,vlm]"
5-
RUN uv pip install --system hf_transfer liger_kernel trackio peft
6-
RUN uv pip install --system https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
5+
RUN uv pip install --system kernels liger_kernel peft trackio

docker/trl/Dockerfile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-runtime
1+
FROM pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
2+
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
23
RUN pip install --upgrade pip uv
3-
RUN uv pip install --system trl[liger,peft,vlm] hf_transfer trackio
4-
RUN uv pip install --system https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
4+
RUN uv pip install --system trl[liger,peft,vlm] kernels trackio

docs/source/_toctree.yml

Lines changed: 18 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -56,22 +56,14 @@
5656
title: Examples
5757
- sections:
5858
- sections: # Sorted alphabetically
59-
- local: cpo_trainer
60-
title: CPO
6159
- local: dpo_trainer
6260
title: DPO
6361
- local: online_dpo_trainer
6462
title: Online DPO
65-
- local: gkd_trainer
66-
title: GKD
6763
- local: grpo_trainer
6864
title: GRPO
6965
- local: kto_trainer
7066
title: KTO
71-
- local: nash_md_trainer
72-
title: Nash-MD
73-
- local: ppo_trainer
74-
title: PPO
7567
- local: prm_trainer
7668
title: PRM
7769
- local: reward_trainer
@@ -80,15 +72,11 @@
8072
title: RLOO
8173
- local: sft_trainer
8274
title: SFT
83-
- local: xpo_trainer
84-
title: XPO
8575
title: Trainers
8676
- local: models
8777
title: Model Classes
8878
- local: model_utils
8979
title: Model Utilities
90-
- local: judges
91-
title: Judges
9280
- local: callbacks
9381
title: Callbacks
9482
- local: data_utils
@@ -107,14 +95,32 @@
10795
title: BEMA for Reference Model
10896
- local: bco_trainer
10997
title: BCO
98+
- local: cpo_trainer
99+
title: CPO
110100
- local: gfpo
111101
title: GFPO
102+
- local: gkd_trainer
103+
title: GKD
112104
- local: gold_trainer
113105
title: GOLD
114106
- local: grpo_with_replay_buffer
115107
title: GRPO With Replay Buffer
116108
- local: gspo_token
117109
title: GSPO-token
110+
- local: judges
111+
title: Judges
112+
- local: minillm
113+
title: MiniLLM
114+
- local: nash_md_trainer
115+
title: Nash-MD
116+
- local: orpo_trainer
117+
title: ORPO
118+
- local: papo_trainer
119+
title: PAPO
120+
- local: ppo_trainer
121+
title: PPO
122+
- local: xpo_trainer
123+
title: XPO
118124
- local: openenv
119125
title: OpenEnv Integration
120126
- local: orpo_trainer

0 commit comments

Comments
 (0)