🔥 [Refactor] RLOOTrainer #3801

shirinyamani · 2025-07-29T23:41:18Z

What does this PR do?

This is an RLOO trainer which is an updated version of the RLOO in TRL < 0.25.0. This is a simple script to run;

# train_rloo.py
from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer

dataset = load_dataset("trl-lib/tldr", split="train")

# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
    return [-abs(20 - len(completion)) for completion in completions]

training_args = RLOOTrainer(output_dir="Qwen2-0.5B-GRPO")
trainer = RLOOConfig(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_len,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

Please refer to migration guide section of the readme to see the complete list of renamed/removed params.

Benchmark on TLDR:

red is old
blue is new (training was interrupted)

To reproduce:

Before

Add this to the new trainer

# in _generate_and_score_completions
self._metrics[mode]["objective/scores"].append(mean_grouped_rewards.mean().item())
self._metrics[mode]["policy/clipfrac_avg"].append(gathered_clip_ratio.nanmean().item())
# in compute_loss
self._metrics[mode]["loss/policy_avg"].append(loss.item())

in the old trainer, the reward was not computed properly. Replace this:

_, score, _ = get_reward(
    reward_model, postprocessed_query_response, processing_class.pad_token_id, context_length
)

by

score = reward_model(postprocessed_query_response).logits[:, 0]

Training script

train.py:

import os
import shutil

from accelerate import PartialState
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    HfArgumentParser,
)

from trl import ModelConfig, RLOOConfig, RLOOTrainer, ScriptArguments
from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE


# Enable logging in a Hugging Face Space
os.environ.setdefault("TRACKIO_SPACE_ID", "trl-trackio")

if __name__ == "__main__":
    parser = HfArgumentParser((ScriptArguments, RLOOConfig, ModelConfig))
    script_args, training_args, model_args = parser.parse_args_into_dataclasses()
    # remove output_dir if exists
    shutil.rmtree(training_args.output_dir, ignore_errors=True)

    model_id = "cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr"
    reward_model_id = "cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr"

    ################
    # Model & Tokenizer
    ################
    tokenizer = AutoTokenizer.from_pretrained(
        model_id, padding_side="left", trust_remote_code=model_args.trust_remote_code
    )
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})
    if tokenizer.chat_template is None:
        tokenizer.chat_template = SIMPLE_CHAT_TEMPLATE
    reward_model = AutoModelForSequenceClassification.from_pretrained(
        reward_model_id, trust_remote_code=model_args.trust_remote_code, num_labels=1
    )
    reward_model.config.pad_token_id=0
    ref_policy = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=model_args.trust_remote_code)
    policy = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=model_args.trust_remote_code)

    ################
    # Dataset
    ################
    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
    train_dataset = dataset[script_args.dataset_train_split]
    eval_dataset = dataset[script_args.dataset_test_split]

    def prepare_dataset(dataset, tokenizer):
        """pre-tokenize the dataset before training; only collate during training"""

        def tokenize(element):
            input_ids = tokenizer.apply_chat_template(
                element["messages"][:1],
                padding=False,
                add_generation_prompt=True,
            )
            return {"input_ids": input_ids, "lengths": len(input_ids)}

        return dataset.map(
            tokenize,
            remove_columns=dataset.column_names,
            num_proc=training_args.dataset_num_proc,
        )

    # Compute that only on the main process for faster data processing.
    # see: https://github.com/huggingface/trl/pull/1255
    with PartialState().local_main_process_first():
        train_dataset = prepare_dataset(train_dataset, tokenizer)
        eval_dataset = prepare_dataset(eval_dataset, tokenizer)
        # filtering
        train_dataset = train_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=training_args.dataset_num_proc)
        eval_dataset = eval_dataset.filter(lambda x: x["lengths"] <= 512, num_proc=training_args.dataset_num_proc)

    assert train_dataset[0]["input_ids"][-1] != tokenizer.eos_token_id, "The last token should not be an EOS token"

    ################
    # Training
    ################
    trainer = RLOOTrainer(
        config=training_args,
        processing_class=tokenizer,
        policy=policy,
        ref_policy=ref_policy,
        reward_model=reward_model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
    )
    trainer.train()

    # Save and push to hub
    trainer.save_model(training_args.output_dir)
    if training_args.push_to_hub:
        trainer.push_to_hub(dataset_name=script_args.dataset_name)

python train.py \
    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
    --dataset_test_split validation \
    --learning_rate 3e-6 \
    --output_dir pythia-1b-deduped-tldr-preference-sft-trl-style-rloo \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 64 \
    --num_ppo_epochs 4 \
    --total_episodes 30000 \
    --stop_token eos \
    --response_length 53 \
    # only available in new trainer:
    --logging_steps 4 \
    --log_completions \
    --num_completions_to_print 1

Results

See above

Why don't we get an exact match?

One of the reasons is that in the old trainer, the learning rate schedule was wrong, consequently it decays 2 times slower and ranges from 3e-6 to 1.5e-6.

HuggingFaceDocBuilderDev · 2025-07-29T23:45:36Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ore_ method

docs/source/rloo_trainer.md

edbeeching

Thanks for this clean refactor @shirinyamani and @qgallouedec .

Would it be possible to include more details in the PR description of the experiments you have run to validate the results with the new implementation vs the original one?

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

…eters for clarity

…e calculation in warnings

…enerations factor

shirinyamani · 2025-08-29T15:57:36Z

Thanks for this clean refactor @shirinyamani and @qgallouedec .

Would it be possible to include more details in the PR description of the experiments you have run to validate the results with the new implementation vs the original one?

Done! 🔥

Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

shirinyamani added 3 commits July 29, 2025 23:38

rloo config

6e34d8f

rloo config

af9fe7f

rloo trainer

8f44d35

shirinyamani added 10 commits July 29, 2025 23:46

commit

0b796f2

compared to grpo addition/deletion

8e8cfda

test added + kl method input fixed

acbff2a

Merge branch 'main' into rloo_final

b103033

test fixed

d94479e

kl apply_kl methods tested

7fc2e9e

Merge branch 'main' into rloo_final

73b6ff7

grpo copy and image removed + baseline + seq_level kl added in gen_sc…

6fef361

…ore_ method

epsilon high removed + input[kl]

2be7c42

exp for new config same as old rloo + main train arg fixed

bd4945c

shirinyamani mentioned this pull request Aug 4, 2025

Optimize RLOO Trainer memory usage with string-level processing #3837

Closed

5 tasks

shirinyamani added 15 commits August 3, 2025 23:20

Merge branch 'main' into rloo_final

d81dbb7

default com_len /max_new_token changed to match grpo

5c12166

proper old rloo test

c6e2bef

equal test to old rloo in grpo

22efc29

rloo

a66eae3

rloo config

28ea2f8

test for new rloo --needs modification

0bf7d05

no change

28c7bf5

Merge branch 'main' into rloo_final

20cb5f8

stash fixed

640706a

reward model removed

a24a44c

max completion len back to 256

9a6c922

seq level + loss type config

05e6844

tests for grpo old and new rloo

c968c0e

Merge branch 'main' into rloo_final

3d99b8f

edbeeching reviewed Aug 27, 2025

View reviewed changes

docs/source/rloo_trainer.md Show resolved Hide resolved

edbeeching approved these changes Aug 27, 2025

View reviewed changes

shirinyamani and others added 5 commits August 27, 2025 07:58

Update docs/source/rloo_trainer.md

e46ce18

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

compute advtg doc updated

9c0804c

Merge branch 'main' into fake_support_branch_for_rloo

dd36dc4

Merge branch 'fake_support_branch_for_rloo' into rloo_final

7239523

typo

5d2e2b8

qgallouedec changed the base branch from fake_support_branch_for_rloo to main August 28, 2025 16:37

qgallouedec added 6 commits August 28, 2025 17:13

Add warnings for pre-tokenized datasets and decode input_ids to prompt

ff73967

Add check for eval_dataset before accessing column names

f2d0bb2

Update RLOOConfig: adjust total_episodes calculation and rename param…

7978dac

…eters for clarity

Update total_episodes metadata to clarify deprecation and adjust valu…

aa954af

…e calculation in warnings

Fix total_episodes calculation in deprecation warning to remove num_g…

5af6983

…enerations factor

fix doc

893c402

shirinyamani merged commit e7b37d4 into main Aug 29, 2025
11 checks passed

shirinyamani deleted the rloo_final branch August 29, 2025 15:27

huggingface deleted a comment from qgallouedec Aug 29, 2025

qgallouedec mentioned this pull request Oct 30, 2025

Complete paper index #4407

Open

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔥 [Refactor] RLOOTrainer #3801

🔥 [Refactor] RLOOTrainer #3801

shirinyamani commented Jul 29, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 29, 2025

Uh oh!

Uh oh!

edbeeching left a comment

Uh oh!

Uh oh!

shirinyamani commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

🔥 [Refactor] RLOOTrainer #3801

🔥 [Refactor] RLOOTrainer #3801

Conversation

shirinyamani commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Benchmark on TLDR:

Before

Training script

Results

Why don't we get an exact match?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 29, 2025

Uh oh!

Uh oh!

edbeeching left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shirinyamani commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shirinyamani commented Jul 29, 2025 •

edited

Loading