From 0a756fd39e30b82698017cf12e010d21a621eac2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Quentin=20Gallou=C3=A9dec?=
 <quentin.gallouedec@huggingface.co>
Date: Fri, 4 Oct 2024 16:35:39 +0000
Subject: [PATCH] add trl to tag for models

---
 docs/source/alignprop_trainer.mdx     | 2 +-
 docs/source/bco_trainer.mdx           | 2 +-
 docs/source/cpo_trainer.mdx           | 2 +-
 docs/source/ddpo_trainer.mdx          | 2 +-
 docs/source/dpo_trainer.mdx           | 2 +-
 docs/source/gkd_trainer.md            | 2 +-
 docs/source/iterative_sft_trainer.mdx | 2 +-
 docs/source/kto_trainer.mdx           | 2 +-
 docs/source/nash_md_trainer.md        | 2 +-
 docs/source/online_dpo_trainer.md     | 2 +-
 docs/source/orpo_trainer.md           | 2 +-
 docs/source/ppo_trainer.mdx           | 2 +-
 docs/source/ppov2_trainer.md          | 2 +-
 docs/source/reward_trainer.mdx        | 2 +-
 docs/source/rloo_trainer.md           | 2 +-
 docs/source/sft_trainer.mdx           | 2 +-
 docs/source/xpo_trainer.mdx           | 2 +-
 17 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/docs/source/alignprop_trainer.mdx b/docs/source/alignprop_trainer.mdx
index 157f44207b..d76b5665da 100644
--- a/docs/source/alignprop_trainer.mdx
+++ b/docs/source/alignprop_trainer.mdx
@@ -1,6 +1,6 @@
 # Aligning Text-to-Image Diffusion Models with Reward Backpropagation
 
-[![](https://img.shields.io/badge/All_models-AlignProp-blue)](https://huggingface.co/models?other=alignprop)
+[![](https://img.shields.io/badge/All_models-AlignProp-blue)](https://huggingface.co/models?other=alignprop,trl)
 
 ## The why
 
diff --git a/docs/source/bco_trainer.mdx b/docs/source/bco_trainer.mdx
index c200d91cbc..adae3c3fa6 100644
--- a/docs/source/bco_trainer.mdx
+++ b/docs/source/bco_trainer.mdx
@@ -1,6 +1,6 @@
 # BCO Trainer
 
-[![](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco)
+[![](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco,trl)
 
 TRL supports the Binary Classifier Optimization (BCO).
 The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
diff --git a/docs/source/cpo_trainer.mdx b/docs/source/cpo_trainer.mdx
index 6ebb3b84ca..5546258834 100644
--- a/docs/source/cpo_trainer.mdx
+++ b/docs/source/cpo_trainer.mdx
@@ -1,6 +1,6 @@
 # CPO Trainer
 
-[![](https://img.shields.io/badge/All_models-CPO-blue)](https://huggingface.co/models?other=cpo)
+[![](https://img.shields.io/badge/All_models-CPO-blue)](https://huggingface.co/models?other=cpo,trl)
 
 ## Overview
 
diff --git a/docs/source/ddpo_trainer.mdx b/docs/source/ddpo_trainer.mdx
index 0b50132bb8..20dbbe82b1 100644
--- a/docs/source/ddpo_trainer.mdx
+++ b/docs/source/ddpo_trainer.mdx
@@ -1,6 +1,6 @@
 # Denoising Diffusion Policy Optimization
 
-[![](https://img.shields.io/badge/All_models-DDPO-blue)](https://huggingface.co/models?other=ddpo)
+[![](https://img.shields.io/badge/All_models-DDPO-blue)](https://huggingface.co/models?other=ddpo,trl)
 
 ## The why
 
diff --git a/docs/source/dpo_trainer.mdx b/docs/source/dpo_trainer.mdx
index 01f3b3e097..30b1725f95 100644
--- a/docs/source/dpo_trainer.mdx
+++ b/docs/source/dpo_trainer.mdx
@@ -1,6 +1,6 @@
 # DPO Trainer
 
-[![](https://img.shields.io/badge/All_models-DPO-blue)](https://huggingface.co/models?other=dpo)
+[![](https://img.shields.io/badge/All_models-DPO-blue)](https://huggingface.co/models?other=dpo,trl)
 
 ## Overview
 
diff --git a/docs/source/gkd_trainer.md b/docs/source/gkd_trainer.md
index 7e5ab9e6e6..14acc7150c 100644
--- a/docs/source/gkd_trainer.md
+++ b/docs/source/gkd_trainer.md
@@ -1,6 +1,6 @@
 # Generalized Knowledge Distillation Trainer
 
-[![](https://img.shields.io/badge/All_models-GKD-blue)](https://huggingface.co/models?other=gkd)
+[![](https://img.shields.io/badge/All_models-GKD-blue)](https://huggingface.co/models?other=gkd,trl)
 
 ## Overview
 
diff --git a/docs/source/iterative_sft_trainer.mdx b/docs/source/iterative_sft_trainer.mdx
index caf8c5076d..7a4fabbf63 100644
--- a/docs/source/iterative_sft_trainer.mdx
+++ b/docs/source/iterative_sft_trainer.mdx
@@ -1,6 +1,6 @@
 # Iterative Trainer
 
-[![](https://img.shields.io/badge/All_models-Iterative_SFT-blue)](https://huggingface.co/models?other=iterative-sft)
+[![](https://img.shields.io/badge/All_models-Iterative_SFT-blue)](https://huggingface.co/models?other=iterative-sft,trl)
 
 
 Iterative fine-tuning is a training method that enables to perform custom actions (generation and filtering for example) between optimization steps. In TRL we provide an easy-to-use API to fine-tune your models in an iterative way in just a few lines of code.
diff --git a/docs/source/kto_trainer.mdx b/docs/source/kto_trainer.mdx
index 91c6ea69b1..9a007e63cb 100644
--- a/docs/source/kto_trainer.mdx
+++ b/docs/source/kto_trainer.mdx
@@ -1,6 +1,6 @@
 # KTO Trainer
 
-[![](https://img.shields.io/badge/All_models-KTO-blue)](https://huggingface.co/models?other=kto)
+[![](https://img.shields.io/badge/All_models-KTO-blue)](https://huggingface.co/models?other=kto,trl)
 
 TRL supports the Kahneman-Tversky Optimization (KTO) Trainer for aligning language models with binary feedback data (e.g., upvote/downvote), as described in the [paper](https://huggingface.co/papers/2402.01306) by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela.
 For a full example have a look at  [`examples/scripts/kto.py`].
diff --git a/docs/source/nash_md_trainer.md b/docs/source/nash_md_trainer.md
index e7d6fb679b..f9d878d84f 100644
--- a/docs/source/nash_md_trainer.md
+++ b/docs/source/nash_md_trainer.md
@@ -1,6 +1,6 @@
 # Nash-MD Trainer
 
-[![](https://img.shields.io/badge/All_models-Nash--MD-blue)](https://huggingface.co/models?other=nash-md)
+[![](https://img.shields.io/badge/All_models-Nash--MD-blue)](https://huggingface.co/models?other=nash-md,trl)
 
 ## Overview
 
diff --git a/docs/source/online_dpo_trainer.md b/docs/source/online_dpo_trainer.md
index bbb2fe1b1c..953c999d56 100644
--- a/docs/source/online_dpo_trainer.md
+++ b/docs/source/online_dpo_trainer.md
@@ -1,6 +1,6 @@
 # Online DPO Trainer
 
-[![](https://img.shields.io/badge/All_models-Online_DPO-blue)](https://huggingface.co/models?other=online-dpo)
+[![](https://img.shields.io/badge/All_models-Online_DPO-blue)](https://huggingface.co/models?other=online-dpo,trl)
 
 ## Overview 
 
diff --git a/docs/source/orpo_trainer.md b/docs/source/orpo_trainer.md
index c717eff454..578792eb29 100644
--- a/docs/source/orpo_trainer.md
+++ b/docs/source/orpo_trainer.md
@@ -1,6 +1,6 @@
 # ORPO Trainer
 
-[![](https://img.shields.io/badge/All_models-ORPO-blue)](https://huggingface.co/models?other=orpo)
+[![](https://img.shields.io/badge/All_models-ORPO-blue)](https://huggingface.co/models?other=orpo,trl)
 
 [Odds Ratio Preference Optimization](https://huggingface.co/papers/2403.07691) (ORPO) by Jiwoo Hong, Noah Lee, and James Thorne studies the crucial role of SFT within the context of preference alignment. Using preference data the method posits that a minor penalty for the disfavored generation together with a strong adaption signal to the chosen response via a simple log odds ratio term appended to the NLL loss is sufficient for preference-aligned SFT.
 
diff --git a/docs/source/ppo_trainer.mdx b/docs/source/ppo_trainer.mdx
index c1eb54d912..ebc97a9e28 100644
--- a/docs/source/ppo_trainer.mdx
+++ b/docs/source/ppo_trainer.mdx
@@ -1,6 +1,6 @@
 # PPO Trainer
 
-[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo)
+[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo,trl)
 
 TRL supports the [PPO](https://huggingface.co/papers/1707.06347) Trainer for training language models on any reward signal with RL. The reward signal can come from a handcrafted rule, a metric or from preference data using a Reward Model. For a full example have a look at [`examples/notebooks/gpt2-sentiment.ipynb`](https://github.com/lvwerra/trl/blob/main/examples/notebooks/gpt2-sentiment.ipynb). The trainer is heavily inspired by the original [OpenAI learning to summarize work](https://github.com/openai/summarize-from-feedback).
 
diff --git a/docs/source/ppov2_trainer.md b/docs/source/ppov2_trainer.md
index bb37079b17..93adf0ffdc 100644
--- a/docs/source/ppov2_trainer.md
+++ b/docs/source/ppov2_trainer.md
@@ -1,6 +1,6 @@
 # PPOv2 Trainer
 
-[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo)
+[![](https://img.shields.io/badge/All_models-PPO-blue)](https://huggingface.co/models?other=ppo,trl)
 
 TRL supports training LLMs with [Proximal Policy Optimization (PPO)](https://huggingface.co/papers/1707.06347).
 
diff --git a/docs/source/reward_trainer.mdx b/docs/source/reward_trainer.mdx
index 3ac3de3261..df3185007d 100644
--- a/docs/source/reward_trainer.mdx
+++ b/docs/source/reward_trainer.mdx
@@ -1,6 +1,6 @@
 # Reward Modeling
 
-[![](https://img.shields.io/badge/All_models-Reward_Trainer-blue)](https://huggingface.co/models?other=reward-trainer)
+[![](https://img.shields.io/badge/All_models-Reward_Trainer-blue)](https://huggingface.co/models?other=reward-trainer,trl)
 
 TRL supports custom reward modeling for anyone to perform reward modeling on their dataset and model.
 
diff --git a/docs/source/rloo_trainer.md b/docs/source/rloo_trainer.md
index 4f31c9b7fc..cf1546a414 100644
--- a/docs/source/rloo_trainer.md
+++ b/docs/source/rloo_trainer.md
@@ -1,6 +1,6 @@
 # RLOO Trainer
 
-[![](https://img.shields.io/badge/All_models-RLOO-blue)](https://huggingface.co/models?other=rloo)
+[![](https://img.shields.io/badge/All_models-RLOO-blue)](https://huggingface.co/models?other=rloo,trl)
 
 TRL supports training LLMs with REINFORCE Leave-One-Out (RLOO). The idea is that instead of using a value function, RLOO generates K completions for each prompt. For each completion, RLOO uses the mean scores from the other K-1 completions as a baseline to calculate the advantage. RLOO also models the entire completion as a single action, where as PPO models each token as an action. Note that REINFORCE / A2C is a special case of PPO, when the number of PPO epochs is 1 and the number of mini-batches is 1, which is how we implement RLOO in TRL.
 
diff --git a/docs/source/sft_trainer.mdx b/docs/source/sft_trainer.mdx
index 39e9e9b638..1245c56450 100644
--- a/docs/source/sft_trainer.mdx
+++ b/docs/source/sft_trainer.mdx
@@ -1,6 +1,6 @@
 # Supervised Fine-tuning Trainer
 
-[![](https://img.shields.io/badge/All_models-SFT-blue)](https://huggingface.co/models?other=sft)
+[![](https://img.shields.io/badge/All_models-SFT-blue)](https://huggingface.co/models?other=sft,trl)
 
 Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset.
 
diff --git a/docs/source/xpo_trainer.mdx b/docs/source/xpo_trainer.mdx
index 3d355b4dca..16a0767efa 100644
--- a/docs/source/xpo_trainer.mdx
+++ b/docs/source/xpo_trainer.mdx
@@ -1,6 +1,6 @@
 # XPO Trainer
 
-[![](https://img.shields.io/badge/All_models-XPO-blue)](https://huggingface.co/models?other=xpo)
+[![](https://img.shields.io/badge/All_models-XPO-blue)](https://huggingface.co/models?other=xpo,trl)
 
 ## Overview