Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏷️ Model badges: select only TRL models #2178

Merged
merged 1 commit into from
Oct 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/alignprop_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Aligning Text-to-Image Diffusion Models with Reward Backpropagation

[![](https://img.shields.io/badge/All_models-AlignProp-blue)](https://huggingface.co/models?other=alignprop)
[![](https://img.shields.io/badge/All_models-AlignProp-blue)](https://huggingface.co/models?other=alignprop,trl)

## The why

Expand Down
2 changes: 1 addition & 1 deletion docs/source/bco_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# BCO Trainer

[![](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco)
[![](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco,trl)

TRL supports the Binary Classifier Optimization (BCO).
The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/cpo_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CPO Trainer

[![](https://img.shields.io/badge/All_models-CPO-blue)](https://huggingface.co/models?other=cpo)
[![](https://img.shields.io/badge/All_models-CPO-blue)](https://huggingface.co/models?other=cpo,trl)

## Overview

Expand Down
2 changes: 1 addition & 1 deletion docs/source/ddpo_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Denoising Diffusion Policy Optimization

[![](https://img.shields.io/badge/All_models-DDPO-blue)](https://huggingface.co/models?other=ddpo)
[![](https://img.shields.io/badge/All_models-DDPO-blue)](https://huggingface.co/models?other=ddpo,trl)

## The why

Expand Down
2 changes: 1 addition & 1 deletion docs/source/dpo_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# DPO Trainer

[![](https://img.shields.io/badge/All_models-DPO-blue)](https://huggingface.co/models?other=dpo)
[![](https://img.shields.io/badge/All_models-DPO-blue)](https://huggingface.co/models?other=dpo,trl)

## Overview

Expand Down
2 changes: 1 addition & 1 deletion docs/source/gkd_trainer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Generalized Knowledge Distillation Trainer

[![](https://img.shields.io/badge/All_models-GKD-blue)](https://huggingface.co/models?other=gkd)
[![](https://img.shields.io/badge/All_models-GKD-blue)](https://huggingface.co/models?other=gkd,trl)

## Overview

Expand Down
2 changes: 1 addition & 1 deletion docs/source/iterative_sft_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Iterative Trainer

[![](https://img.shields.io/badge/All_models-Iterative_SFT-blue)](https://huggingface.co/models?other=iterative-sft)
[![](https://img.shields.io/badge/All_models-Iterative_SFT-blue)](https://huggingface.co/models?other=iterative-sft,trl)


Iterative fine-tuning is a training method that enables to perform custom actions (generation and filtering for example) between optimization steps. In TRL we provide an easy-to-use API to fine-tune your models in an iterative way in just a few lines of code.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/kto_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# KTO Trainer

[![](https://img.shields.io/badge/All_models-KTO-blue)](https://huggingface.co/models?other=kto)
[![](https://img.shields.io/badge/All_models-KTO-blue)](https://huggingface.co/models?other=kto,trl)

TRL supports the Kahneman-Tversky Optimization (KTO) Trainer for aligning language models with binary feedback data (e.g., upvote/downvote), as described in the [paper](https://huggingface.co/papers/2402.01306) by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela.
For a full example have a look at [`examples/scripts/kto.py`].
Expand Down
2 changes: 1 addition & 1 deletion docs/source/nash_md_trainer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Nash-MD Trainer

[![](https://img.shields.io/badge/All_models-Nash--MD-blue)](https://huggingface.co/models?other=nash-md)
[![](https://img.shields.io/badge/All_models-Nash--MD-blue)](https://huggingface.co/models?other=nash-md,trl)

## Overview

Expand Down
2 changes: 1 addition & 1 deletion docs/source/online_dpo_trainer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Online DPO Trainer

[![](https://img.shields.io/badge/All_models-Online_DPO-blue)](https://huggingface.co/models?other=online-dpo)
[![](https://img.shields.io/badge/All_models-Online_DPO-blue)](https://huggingface.co/models?other=online-dpo,trl)

## Overview

Expand Down
2 changes: 1 addition & 1 deletion docs/source/orpo_trainer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ORPO Trainer

[![](https://img.shields.io/badge/All_models-ORPO-blue)](https://huggingface.co/models?other=orpo)
[![](https://img.shields.io/badge/All_models-ORPO-blue)](https://huggingface.co/models?other=orpo,trl)

[Odds Ratio Preference Optimization](https://huggingface.co/papers/2403.07691) (ORPO) by Jiwoo Hong, Noah Lee, and James Thorne studies the crucial role of SFT within the context of preference alignment. Using preference data the method posits that a minor penalty for the disfavored generation together with a strong adaption signal to the chosen response via a simple log odds ratio term appended to the NLL loss is sufficient for preference-aligned SFT.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/ppo_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# PPO Trainer

[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo)
[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo,trl)

TRL supports the [PPO](https://huggingface.co/papers/1707.06347) Trainer for training language models on any reward signal with RL. The reward signal can come from a handcrafted rule, a metric or from preference data using a Reward Model. For a full example have a look at [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb). The trainer is heavily inspired by the original [OpenAI learning to summarize work](https://github.com/openai/summarize-from-feedback).

Expand Down
2 changes: 1 addition & 1 deletion docs/source/ppov2_trainer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# PPOv2 Trainer

[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo)
[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo,trl)

TRL supports training LLMs with [Proximal Policy Optimization (PPO)](https://huggingface.co/papers/1707.06347).

Expand Down
2 changes: 1 addition & 1 deletion docs/source/reward_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Reward Modeling

[![](https://img.shields.io/badge/All_models-Reward_Trainer-blue)](https://huggingface.co/models?other=reward-trainer)
[![](https://img.shields.io/badge/All_models-Reward_Trainer-blue)](https://huggingface.co/models?other=reward-trainer,trl)

TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/rloo_trainer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# RLOO Trainer

[![](https://img.shields.io/badge/All_models-RLOO-blue)](https://huggingface.co/models?other=rloo)
[![](https://img.shields.io/badge/All_models-RLOO-blue)](https://huggingface.co/models?other=rloo,trl)

TRL supports training LLMs with REINFORCE Leave-One-Out (RLOO). The idea is that instead of using a value function, RLOO generates K completions for each prompt. For each completion, RLOO uses the mean scores from the other K-1 completions as a baseline to calculate the advantage. RLOO also models the entire completion as a single action, where as PPO models each token as an action. Note that REINFORCE / A2C is a special case of PPO, when the number of PPO epochs is 1 and the number of mini-batches is 1, which is how we implement RLOO in TRL.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/sft_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Supervised Fine-tuning Trainer

[![](https://img.shields.io/badge/All_models-SFT-blue)](https://huggingface.co/models?other=sft)
[![](https://img.shields.io/badge/All_models-SFT-blue)](https://huggingface.co/models?other=sft,trl)

Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/xpo_trainer.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# XPO Trainer

[![](https://img.shields.io/badge/All_models-XPO-blue)](https://huggingface.co/models?other=xpo)
[![](https://img.shields.io/badge/All_models-XPO-blue)](https://huggingface.co/models?other=xpo,trl)

## Overview

Expand Down
Loading