You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/reducing_memory_usage.md
+13-2Lines changed: 13 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -116,10 +116,9 @@ trainer = SFTTrainer(
116
116
117
117
PEFT can be combined with other memory reduction techniques such as quantization (4-bit or 8-bit) for even greater memory savings. See [PEFT Integration](peft_integration) for quantization examples.
118
118
119
-
120
119
## Liger for reducing peak memory usage
121
120
122
-
> [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.
121
+
[Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%.
123
122
124
123
For more information, see [Liger Kernel Integration](liger_kernel_integration).
Offloading the vLLM weights and cache helps keep GPU memory usage low, which can be particularly beneficial when training large models or using limited GPU resources. However, waking the vLLM engine from sleep mode introduces some host–device transfer latency, which may slightly impact training speed.
321
+
322
+
## Gradient checkpointing
323
+
324
+
Gradient checkpointing trades compute for memory by not storing all intermediate activations during the forward pass, recomputing them during the backward pass instead.
Gradient checkpointing is available and activated by default across all TRL trainers. For more memory optimization techniques, see the [Transformers Performance Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing).
Copy file name to clipboardExpand all lines: docs/source/speeding_up_training.md
+20-44Lines changed: 20 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,20 +103,9 @@ You can customize the server configuration by passing additional arguments. For
103
103
104
104
## Optimized attention implementations
105
105
106
-
TRL supports various optimized attention implementations that can significantly speed up training while reducing memory usage. You can use either locally installed backends (like Flash Attention 2) or pull pre-optimized kernels directly from the [Kernels Hub](kernels_hub).
106
+
TRL supports various optimized attention implementations that can significantly speed up training while reducing memory usage. You can use either a pre-optimized kernels directly from the [Kernels Hub](kernels_hub) or a manually built attention backend.
107
107
108
108
<hfoptions id="attention examples">
109
-
<hfoption id="Flash Attention 2">
110
-
111
-
To enable Flash Attention 2, pass `attn_implementation="flash_attention_2"`in the model initialization arguments:
Other options include `kernels-community/vllm-flash-attn3` and `kernels-community/paged-attention`.
131
120
132
-
</hfoption>
133
-
</hfoptions>
121
+
Optimized attention works across all TRL trainers. For more details, see [Kernels Hub Integration](kernels_hub).
134
122
135
-
Optimized attention works across all TRL trainers. For more details, see [Kernels Hub Integration](kernels_hub) and [Reducing Memory Usage](reducing_memory_usage#padding-free).
123
+
</hfoption>
124
+
<hfoption id="Manual build">
136
125
137
-
## PEFT for parameter-efficient training
126
+
> [!WARNING]
127
+
> Manually building optimized attention backends is complex and time-consuming. It's never recommended unless absolutely necessary. Consider using Kernels from the Hub instead, as described in the previous section.
138
128
139
-
[PEFT](https://huggingface.co/docs/peft/index) (Parameter-Efficient Fine-Tuning) methods like LoRA significantly reduce memory usage and training time by only training a small number of adapter parameters instead of the full model.
129
+
If you have manually installed an optimized attention backend like Flash Attention 2, you can specify it in the training arguments:
For more information, see [Liger Kernel Integration](liger_kernel_integration).
205
-
206
-
## Gradient checkpointing for memory savings
207
-
208
-
Gradient checkpointing trades compute for memory by not storing all intermediate activations during the forward pass, recomputing them during the backward pass instead.
Gradient checkpointing is available across all TRL trainers. For more memory optimization techniques, see the [Transformers Performance Guide](https://huggingface.co/docs/transformers/perf_train_gpu_one#gradient-checkpointing).
189
+
</hfoption>
190
+
</hfoptions>
191
+
192
+
For more information, see [Liger Kernel Integration](liger_kernel_integration).
0 commit comments