Skip to content

Releases: huggingface/trl

v0.7.1: Patch release

30 Aug 15:38
Compare
Choose a tag to compare

Patch release: fix bug with PPOTrainer and log_stats

Fixed a bug with log_stats of PPOTrainer to avoid breaking behaviour

What's Changed

Full Changelog: v0.7.0...v0.7.1

v0.7.0: Text Environments, Agents & Tools

30 Aug 15:38
Compare
Choose a tag to compare

Text environments, LLMs with tools and agents!

Text environments provide a learning ground for language agents. It allows a language model to use tools to accomplish a task such as using a Python interpreter to answer math questions or using a search index for trivia questions. Having access to tools allows language models to solve tasks that would be very hard for the models itself but can be trivial for the appropriate tools.

We are excited to bring to the community a complete set of functionalities and full examples to train LLMs to use tools!

Check out the documentation page here and few examples below:

What's Changed

Full Changelog: v0.6.0...v0.7.0

v0.6.0

25 Aug 15:08
Compare
Choose a tag to compare

DDPO for diffusion models

We are excited to welcome the first RLHF + diffusion models algorithm to refine the generations from diffusion models.
Read more about it directly in the docs.

Before After DDPO finetuning

Bug fixes and other enhancements

The release also comes with multiple bug fixes reported and/or led by the community, check out the commit history below

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.6.0

v0.5.0

02 Aug 09:08
Compare
Choose a tag to compare

v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

  • DPO Trainer by @kashif in #416
  • [DPO] make sure all the concated batches are on same device by @kashif in #528
  • [DPO] remove response/pairs from the DPO side by @kashif in #540
  • [DPO] remove unnecessary batch size arg to Collator by @kashif in #554
  • [DPO] Resolve logging for DPOTrainer by @tomaarsen in #570

What's Changed

  • Reward trainer multi-gpu eval bug by @rlindskog in #513
  • Use local process index for _get_current_device() by @lewtun in #515

Extending the DataCollatorForCompletionOnlyLM

You can now mask out the users prompts in the DataCollatorForCompletionOnlyLM data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

  • Introducing DataCollatorForChatCompletionOnlyLM by @gaetanlop in #456

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

New Contributors

Full Changelog: v0.4.7...v0.5.0

v0.4.7

13 Jul 09:08
Compare
Choose a tag to compare

Patch release: SFTTrainer and PPOTrainer bug fixes

What's Changed

New Contributors

Full Changelog: v0.4.6...v0.4.7

v0.4.6

23 Jun 09:19
Compare
Choose a tag to compare

Patch release

Patch release to fix a bug on google colab with PPOTrainer & PPOConfig + wandb

What's Changed

Full Changelog: v0.4.5...v0.4.6

v0.4.5

23 Jun 08:40
Compare
Choose a tag to compare

Patch release 1 - SFTTrainer enhancements and fixes

This patch release adds multiple fixes for the SFTTrainer and enhancements. Another patch release is coming for fixing an issue with PPOTrainer and Google Colab combined with wandb logging

What's Changed

New Contributors

Full Changelog: v0.4.4...v0.4.5

v0.4.4

08 Jun 14:42
Compare
Choose a tag to compare

Patch release

Full Changelog: v0.4.3...v0.4.4

v0.4.3

08 Jun 08:54
Compare
Choose a tag to compare

0.4.3 Patch release

Patch release - pin accelerate version

Full Changelog: v0.4.2...v0.4.3

v0.4.2

07 Jun 13:20
Compare
Choose a tag to compare

QLoRA RLHF, SFT Trainer and RewardTrainer

A new version of TRL that includes training larger models using QLoRA (4 bit quantization through bitsandbytes), brand new classes RewardTrainer and SFTTrainer to easily conduct your RLHF projects end-to-end!

Introducing SFTTrainer and RewardTrainer

Use the brand new trainer to easily train your reward model and supervised fine-tuned (SFT) model with few lines of code!

QLoRA integration

Pass 4bit models directly into PPOTrainer for more memory efficient training

Updated StackLlama example

Great work by @mnoukhov that managed to fix the issues related with StackLlama and the new versions of accelerate, peft and transformers. The completely reproducible examples below:

  • StackLLaMA: correctly merge peft model by @mnoukhov in #398
  • StackLlama: fixed RL training and added args by @mnoukhov in #400
  • Fixed some type annotations of trl.trainer.PPoTrainer by @JulesGM in #392
  • StackLLaMA: fix supervised finetuning and reward model training by @mnoukhov in #399

Bug fixes and improvements

New Contributors

Full Changelog: v0.4.1...v0.4.2