Skip to content

Commit c55ef4b

Browse files
authored
Update How-to guides (#4604)
1 parent c686d7d commit c55ef4b

File tree

2 files changed

+33
-46
lines changed

2 files changed

+33
-46
lines changed

docs/source/reducing_memory_usage.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -116,10 +116,9 @@ trainer = SFTTrainer(
116116

117117
PEFT can be combined with other memory reduction techniques such as quantization (4-bit or 8-bit) for even greater memory savings. See [PEFT Integration](peft_integration) for quantization examples.
118118

119-
120119
## Liger for reducing peak memory usage
121120

122-
> [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.
121+
[Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.
123122

124123
For more information, see [Liger Kernel Integration](liger_kernel_integration).
125124

@@ -319,3 +318,15 @@ training_args = RLOOConfig(..., vllm_enable_sleep_mode=True)
319318
</hfoptions>
320319

321320
Offloading the vLLM weights and cache helps keep GPU memory usage low, which can be particularly beneficial when training large models or using limited GPU resources. However, waking the vLLM engine from sleep mode introduces some host–device transfer latency, which may slightly impact training speed.
321+
322+
## Gradient checkpointing
323+
324+
Gradient checkpointing trades compute for memory by not storing all intermediate activations during the forward pass, recomputing them during the backward pass instead.
325+
326+
```python
327+
from trl import SFTConfig
328+
329+
training_args = SFTConfig(..., gradient_checkpointing=True)
330+
```
331+
332+
Gradient checkpointing is available and activated by default across all TRL trainers. For more memory optimization techniques, see the [Transformers Performance Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing).

docs/source/speeding_up_training.md

Lines changed: 20 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -103,20 +103,9 @@ You can customize the server configuration by passing additional arguments. For
103103
104104
## Optimized attention implementations
105105
106-
TRL supports various optimized attention implementations that can significantly speed up training while reducing memory usage. You can use either locally installed backends (like Flash Attention 2) or pull pre-optimized kernels directly from the [Kernels Hub](kernels_hub).
106+
TRL supports various optimized attention implementations that can significantly speed up training while reducing memory usage. You can use either a pre-optimized kernels directly from the [Kernels Hub](kernels_hub) or a manually built attention backend.
107107
108108
<hfoptions id="attention examples">
109-
<hfoption id="Flash Attention 2">
110-
111-
To enable Flash Attention 2, pass `attn_implementation="flash_attention_2"` in the model initialization arguments:
112-
113-
```python
114-
from trl import SFTConfig
115-
116-
training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "flash_attention_2"})
117-
```
118-
119-
</hfoption>
120109
<hfoption id="Kernels from Hub">
121110
122111
You can use pre-optimized attention kernels from the Hub without manual compilation:
@@ -129,40 +118,30 @@ training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "kernel
129118
130119
Other options include `kernels-community/vllm-flash-attn3` and `kernels-community/paged-attention`.
131120
132-
</hfoption>
133-
</hfoptions>
121+
Optimized attention works across all TRL trainers. For more details, see [Kernels Hub Integration](kernels_hub).
134122
135-
Optimized attention works across all TRL trainers. For more details, see [Kernels Hub Integration](kernels_hub) and [Reducing Memory Usage](reducing_memory_usage#padding-free).
123+
</hfoption>
124+
<hfoption id="Manual build">
136125
137-
## PEFT for parameter-efficient training
126+
> [!WARNING]
127+
> Manually building optimized attention backends is complex and time-consuming. It's never recommended unless absolutely necessary. Consider using Kernels from the Hub instead, as described in the previous section.
138128
139-
[PEFT](https://huggingface.co/docs/peft/index) (Parameter-Efficient Fine-Tuning) methods like LoRA significantly reduce memory usage and training time by only training a small number of adapter parameters instead of the full model.
129+
If you have manually installed an optimized attention backend like Flash Attention 2, you can specify it in the training arguments:
140130
141131
```python
142-
from peft import LoraConfig
143-
from trl import SFTConfig, SFTTrainer
144-
145-
peft_config = LoraConfig(
146-
r=16,
147-
lora_alpha=32,
148-
lora_dropout=0.05,
149-
target_modules=["q_proj", "v_proj"],
150-
)
151-
152-
trainer = SFTTrainer(
153-
model="Qwen/Qwen2.5-0.5B",
154-
peft_config=peft_config,
155-
args=training_args,
156-
)
132+
from trl import SFTConfig
133+
134+
training_args = SFTConfig(..., model_init_kwargs={"attn_implementation": "flash_attention_2"})
157135
```
158136
159-
For more details, see [PEFT Integration](peft_integration).
137+
</hfoption>
138+
</hfoptions>
160139
161140
## Liger Kernel for memory optimization
162141
163142
Liger Kernel is a collection of Triton kernels designed for LLM training that can increase throughput by 20% and reduce memory usage by 60%.
164143
165-
<hfoptions id="liger examples">
144+
<hfoptions id="liger">
166145
<hfoption id="SFT">
167146
168147
```python
@@ -199,21 +178,18 @@ training_args = KTOConfig(..., use_liger_kernel=True)
199178
```
200179
201180
</hfoption>
202-
</hfoptions>
203-
204-
For more information, see [Liger Kernel Integration](liger_kernel_integration).
205-
206-
## Gradient checkpointing for memory savings
207-
208-
Gradient checkpointing trades compute for memory by not storing all intermediate activations during the forward pass, recomputing them during the backward pass instead.
181+
<hfoption id="GKD">
209182
210183
```python
211-
from trl import SFTConfig
184+
from trl.experimental.gkd import GKDConfig
212185
213-
training_args = SFTConfig(..., gradient_checkpointing=True)
186+
training_args = GKDConfig(..., use_liger_kernel=True)
214187
```
215188
216-
Gradient checkpointing is available across all TRL trainers. For more memory optimization techniques, see the [Transformers Performance Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing).
189+
</hfoption>
190+
</hfoptions>
191+
192+
For more information, see [Liger Kernel Integration](liger_kernel_integration).
217193
218194
## Mixed precision training
219195

0 commit comments

Comments
 (0)