From 53bc0984e396d74f8d49aecfe7ba7056b2051666 Mon Sep 17 00:00:00 2001
From: Behrooz <ermiaazarkhalili@gmail.com>
Date: Sun, 2 Nov 2025 15:33:57 -0800
Subject: [PATCH 1/2] docs: List all trainers that support Liger Kernel

Resolves #4386

- Add "Supported Trainers" section listing SFT, DPO, GRPO, KTO, and GKD
- Replace single SFT example with hfoptions showing all 5 supported trainers
- Remove "under construction" warning as guide is now complete
- Follow same format as reducing_memory_usage.md for consistency
---
 docs/source/liger_kernel_integration.md | 69 +++++++++++++++++++++----
 1 file changed, 59 insertions(+), 10 deletions(-)
diff --git a/docs/source/liger_kernel_integration.md b/docs/source/liger_kernel_integration.md
index 7f9825e8b2a..0a0a95eb0f1 100644
--- a/docs/source/liger_kernel_integration.md
+++ b/docs/source/liger_kernel_integration.md
@@ -1,8 +1,5 @@
 # Liger Kernel Integration
 
-> [!WARNING]
-> Section under construction. Feel free to contribute!
-
 [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%. That way, we can **4x** our context length, as described in the benchmark below. They have implemented Hugging Face compatible `RMSNorm`, `RoPE`, `SwiGLU`, `CrossEntropy`, `FusedLinearCrossEntropy`, with more to come. The kernel works out of the box with [FlashAttention](https://github.com/Dao-AILab/flash-attention), [PyTorch FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed).
 
 With this memory reduction, you can potentially turn off `cpu_offloading` or gradient checkpointing to further boost the performance.
@@ -11,19 +8,71 @@ With this memory reduction, you can potentially turn off `cpu_offloading` or gra
 | --- | --- |
 | ![Speed up](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-tps.png) | ![Memory](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-memory.png) |
 
-1. To use Liger-Kernel in [`SFTTrainer`], first install it by:
-  
+## Supported Trainers
+
+Liger Kernel is supported in the following TRL trainers:
+- **SFT** (Supervised Fine-Tuning)
+- **DPO** (Direct Preference Optimization)
+- **GRPO** (Group Relative Policy Optimization)
+- **KTO** (Kahneman-Tversky Optimization)
+- **GKD** (Generalized Knowledge Distillation)
+
+## Usage
+
+1. First, install Liger Kernel:
+
   ```bash
   pip install liger-kernel
   ```
 
-2. Once installed, set `use_liger_kernel` in [`SFTConfig`]. No other changes are needed!
+2. Once installed, set `use_liger_kernel=True` in your trainer config. No other changes are needed!
+
+<hfoptions id="liger">
+<hfoption id="SFT">
+
+```python
+from trl import SFTConfig
+
+training_args = SFTConfig(..., use_liger_kernel=True)
+```
+
+</hfoption>
+<hfoption id="DPO">
+
+```python
+from trl import DPOConfig
+
+training_args = DPOConfig(..., use_liger_kernel=True)
+```
+
+</hfoption>
+<hfoption id="GRPO">
+
+```python
+from trl import GRPOConfig
+
+training_args = GRPOConfig(..., use_liger_kernel=True)
+```
+
+</hfoption>
+<hfoption id="KTO">
 
 ```python
-training_args = SFTConfig(
-    use_liger_kernel=True,
-    ...
-)
+from trl import KTOConfig
+
+training_args = KTOConfig(..., use_liger_kernel=True)
 ```
 
+</hfoption>
+<hfoption id="GKD">
+
+```python
+from trl import GKDConfig
+
+training_args = GKDConfig(..., use_liger_kernel=True)
+```
+
+</hfoption>
+</hfoptions>
+
 To learn more about Liger-Kernel, visit their [official repository](https://github.com/linkedin/Liger-Kernel/).

From fb7df7d02c469d3110334e481c380423860fb5f1 Mon Sep 17 00:00:00 2001
From: Behrooz <ermiaazarkhalili@gmail.com>
Date: Sun, 2 Nov 2025 15:47:34 -0800
Subject: [PATCH 2/2] docs: Move Multi-Adapter RL section to PEFT integration

Resolves #4397

- Moved Multi-Adapter RL content from standalone page to PEFT integration guide
- Removed docs/source/multi_adapter_rl.md file
- Updated _toctree.yml to remove Multi Adapter RLHF reference
- Reorganized content as subsection within PEFT integration
- Kept experimental warnings and technical details intact
---
 docs/source/_toctree.yml        |   2 -
 docs/source/multi_adapter_rl.md | 102 --------------------------------
 docs/source/peft_integration.md |  88 +++++++++++++++++++++++++++
 3 files changed, 88 insertions(+), 104 deletions(-)
 delete mode 100644 docs/source/multi_adapter_rl.md

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 5db0f53fcaa..effa84b9957 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -55,8 +55,6 @@
     title: LoRA Without Regret
   - local: sentiment_tuning
     title: Sentiment Tuning
-  - local: multi_adapter_rl
-    title: Multi Adapter RLHF
   title: Examples
 - sections:
   - sections: # Sorted alphabetically
diff --git a/docs/source/multi_adapter_rl.md b/docs/source/multi_adapter_rl.md
deleted file mode 100644
index 05d51e25b92..00000000000
--- a/docs/source/multi_adapter_rl.md
+++ /dev/null
@@ -1,102 +0,0 @@
-# Multi Adapter RL (MARL) - a single base model for everything
-
-Here we present an approach that uses a single base model for the entire PPO algorithm - which includes retrieving the reference logits, computing the active logits and the rewards. This feature is experimental as we did not test the convergence of the approach. We encourage the community to let us know if they potentially face issues.
-
-## Requirements
-
-You just need to install `peft` and optionally install `bitsandbytes` as well if you want to go for 8bit base models, for more memory efficient finetuning.
-
-## Summary
-
-You need to address this approach in three stages that we summarize as follows:
-
-1- Train a base model on the target domain (e.g. [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb)) - this is the Supervised Fine Tuning stage - it can leverage the `SFTTrainer` from TRL.
-2- Train a reward model using `peft`. This is required in order to re-use the adapter during the RL optimisation process (step 3 below). We show an example of leveraging the `RewardTrainer` from TRL in [this example](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py)
-3- Fine tune new adapters on the base model using PPO and the reward adapter. ("0 abstraction RL")
-
-Make sure to use the same model (i.e. same architecture and same weights) for the stages 2 & 3.
-
-## Quickstart
-
-Let us assume you have trained your reward adapter on `llama-7b` model using `RewardTrainer` and pushed the weights on the hub under `trl-lib/llama-7b-hh-rm-adapter`.
-When doing PPO, before passing the model to `PPOTrainer` create your model as follows:
-
-```python
-model_name = "huggyllama/llama-7b"
-rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
-
-# PPO adapter
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    lora_dropout=0.05,
-    bias="none",
-    task_type="CAUSAL_LM",
-)
-
-model = AutoModelForCausalLMWithValueHead.from_pretrained(
-    model_name,
-    peft_config=lora_config,
-    reward_adapter=rm_adapter_id,
-)
-
-...
-trainer = PPOTrainer(
-    model=model,
-    ...
-)
-
-...
-```
-
-Then inside your PPO training loop, call the `compute_reward_score` method by accessing the `model` attribute from `PPOTrainer`.
-
-```python
-rewards = trainer.model.compute_reward_score(**inputs)
-```
-
-## Advanced usage
-
-### Control on the adapter name
-
-If you are familiar with the `peft` library, you know that you can use multiple adapters inside the same model. What you can do is train multiple adapters on the same base model to fine-tune on different policies.
-In this case, you want to be able to control the adapter name you want to activate back, after retrieving the reward. For that, simply pass the appropriate `adapter_name` to `ppo_adapter_name` argument when calling `compute_reward_score`.
-
-```python
-adapter_name_policy_1 = "policy_1"
-rewards = trainer.model.compute_reward_score(**inputs, ppo_adapter_name=adapter_name_policy_1)
-...
-```
-
-### Using 4-bit and 8-bit base models
-
-For more memory efficient fine-tuning, you can load your base model in 8-bit or 4-bit while keeping the adapters in the default precision (float32).
-Just pass the appropriate arguments (i.e. `load_in_8bit=True` or `load_in_4bit=True`) to `AutoModelForCausalLMWithValueHead.from_pretrained` as follows (assuming you have installed `bitsandbytes`):
-
-```python
-model_name = "llama-7b"
-rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
-
-# PPO adapter
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    lora_dropout=0.05,
-    bias="none",
-    task_type="CAUSAL_LM",
-)
-
-model = AutoModelForCausalLMWithValueHead.from_pretrained(
-    model_name,
-    peft_config=lora_config,
-    reward_adapter=rm_adapter_id,
-    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
-)
-
-...
-trainer = PPOTrainer(
-    model=model,
-    ...
-)
-...
-```
diff --git a/docs/source/peft_integration.md b/docs/source/peft_integration.md
index 8e8709a2dfa..bd196dd99bf 100644
--- a/docs/source/peft_integration.md
+++ b/docs/source/peft_integration.md
@@ -114,6 +114,94 @@ pretrained_model = AutoModelForCausalLMWithValueHead.from_pretrained(
 
 Finally, make sure that the rewards are computed on correct device as well, for that you can use `ppo_trainer.model.current_device`.
 
+## Multi-Adapter RL Training
+
+You can use a single base model with multiple PEFT adapters for the entire PPO algorithm - including retrieving reference logits, computing active logits, and calculating rewards. This approach is useful for memory-efficient RL training.
+
+> [!WARNING]
+> This feature is experimental and convergence has not been extensively tested. We encourage the community to share feedback and report any issues.
+
+### Requirements
+
+Install PEFT and optionally bitsandbytes for 8-bit models:
+
+```bash
+pip install peft bitsandbytes
+```
+
+### Training Workflow
+
+The multi-adapter approach requires three stages:
+
+1. **Supervised Fine-Tuning (SFT)**: Train a base model on your target domain (e.g., IMDB dataset) using `SFTTrainer`
+2. **Reward Model Training**: Train a reward model adapter using PEFT and `RewardTrainer` (see [reward modeling example](https://github.com/huggingface/trl/tree/main/examples/scripts/reward_modeling.py))
+3. **PPO Training**: Fine-tune new adapters using PPO with the reward adapter
+
+> [!IMPORTANT]
+> Use the same base model (architecture and weights) for stages 2 & 3.
+
+### Basic Usage
+
+After training your reward adapter and pushing it to the Hub:
+
+```python
+from peft import LoraConfig
+from trl import AutoModelForCausalLMWithValueHead, PPOTrainer
+
+model_name = "huggyllama/llama-7b"
+rm_adapter_id = "trl-lib/llama-7b-hh-rm-adapter"
+
+# Configure PPO adapter
+lora_config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM",
+)
+
+# Load model with reward adapter
+model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    model_name,
+    peft_config=lora_config,
+    reward_adapter=rm_adapter_id,
+)
+
+trainer = PPOTrainer(model=model, ...)
+```
+
+In your training loop, compute rewards using:
+
+```python
+rewards = trainer.model.compute_reward_score(**inputs)
+```
+
+### Advanced Features
+
+#### Multiple Policy Adapters
+
+You can train multiple adapters on the same base model for different policies. Control which adapter to activate using the `ppo_adapter_name` argument:
+
+```python
+adapter_name_policy_1 = "policy_1"
+rewards = trainer.model.compute_reward_score(**inputs, ppo_adapter_name=adapter_name_policy_1)
+```
+
+#### Quantized Base Models
+
+For memory-efficient training, load the base model in 8-bit or 4-bit while keeping adapters in float32:
+
+```python
+from transformers import BitsAndBytesConfig
+
+model = AutoModelForCausalLMWithValueHead.from_pretrained(
+    model_name,
+    peft_config=lora_config,
+    reward_adapter=rm_adapter_id,
+    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
+)
+```
+
 ## Naive pipeline parallelism (NPP) for large models (>60B models)
 
 The `trl` library also supports naive pipeline parallelism (NPP) for large models (>60B models). This is a simple way to parallelize the model across multiple GPUs.