You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/distributing_training.md
+137Lines changed: 137 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,6 +55,143 @@ Having one model per GPU can lead to high memory usage, which may not be feasibl
55
55
56
56
</Tip>
57
57
58
+
## Context Parallelism
59
+
60
+
Context Parallelism (CP) is a parallelization technique that enables training with longer sequences by splitting the sequence dimension across multiple GPUs. Each GPU processes a portion of the sequence, allowing you to train with sequences longer than what would fit on a single GPU's memory.
61
+
62
+
For more details on CP, see the [Ultrascale Playbook - Context Parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism).
63
+
64
+
CP is particularly useful when:
65
+
66
+
- You want to train with very long sequences (>32k tokens)
67
+
- Single GPU memory is insufficient for your desired sequence length
68
+
- You need to maintain sequence coherence across the full context
69
+
70
+
### Requirements and Limitations
71
+
72
+
CP has specific requirements:
73
+
74
+
1.**Accelerate 1.10 or higher** is required
75
+
2.**FSDP2 (PyTorch FSDP v2)** is required as the distributed training backend
76
+
3.**SDPA attention** - Flash Attention is currently not supported with CP
77
+
4.**Sequence length divisibility** - sequences must be divisible by `cp_size * 2`. This is now automatically handled using the `pad_to_multiple_of` parameter in the data collator, which works seamlessly with both standard and padding-free modes.
78
+
79
+
### Configuration
80
+
81
+
To enable CP, you need to configure both Accelerate and your training arguments:
82
+
83
+
#### Accelerate Configuration
84
+
85
+
Use one of the provided accelerate config files (e.g. [`context_parallel_2gpu.yaml`](https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/context_parallel_2gpu.yaml) for 2 GPUs):
86
+
87
+
```yaml
88
+
compute_environment: LOCAL_MACHINE
89
+
debug: false
90
+
distributed_type: FSDP
91
+
downcast_bf16: 'no'
92
+
enable_cpu_affinity: false
93
+
fsdp_config:
94
+
fsdp_activation_checkpointing: true # Enable activation checkpointing for memory efficiency
pad_to_multiple_of=4, # ensures divisibility by cp_size * 2
127
+
# to get the most out of CP
128
+
max_length=16384, # long sequence length
129
+
packing=True, # use packing to reduce padding
130
+
use_liger_kernel=True, # compatible with CP
131
+
gradient_checkpointing=False, # The activation_checkpointing in FSDP config and the gradient_checkpointing in training arg can't be set to True simultaneously
132
+
per_device_train_batch_size=1,
133
+
...
134
+
)
135
+
```
136
+
137
+
Then, launch your training script with the appropriate accelerate config file:
We adjusted `num_processes` and `parallelism_config_cp_size` based on the number of GPUs for each run.
168
+
Training was performed with the [sft.py](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py) example script, combined with the parameters described above.
169
+
170
+
The results below summarize the **maximum trainable sequence length** and **iterations per second** for different numbers of GPUs. A value marked as `OOM` indicates that the configuration ran out of memory and could not be trained.
171
+
172
+
These results show that **Context Parallelism (CP) scales effectively with more GPUs**, enabling training on much longer sequences. With **8 GPUs**, context lengths of over **300k tokens** become feasible, unlocking training with extremely long contexts while maintaining reasonable throughput.
173
+
174
+
<divclass="flex justify-center">
175
+
<imgsrc="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/context_parallelism_max_length_plot.png"alt="CP Max content length"width="45%"/>
Accelerate also supports **N-Dimensional Parallelism (ND-parallelism)**, which enables you to combine different parallelization strategies to efficiently distribute model training across multiple GPUs.
182
+
183
+
You can learn more and explore configuration examples in the [Accelerate ND-parallelism guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism).
-[Hugging Face Blog: Enabling Long-Context Training with Sequence Parallelism in Axolotl](https://huggingface.co/blog/axolotl-ai-co/long-context-with-sequence-parallelism-in-axolotl)
193
+
-[Snowflake Engineering Blog: Arctic Long Sequence Training (ALST) — Scalable and Efficient Training for Multi-Million Token Sequences (Note that they use a different strategy)](https://www.snowflake.com/en/engineering-blog/arctic-long-sequence-training-multi-million-token-ai/)
194
+
58
195
## Multi-Node Training
59
196
60
197
We're working on a guide for multi-node training. Stay tuned! 🚀
Copy file name to clipboardExpand all lines: docs/source/kernels_hub.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,8 +61,8 @@ Kernel-based implementations perform on par with custom-installed attention, and
61
61
62
62
63
63
<divclass="flex justify-center">
64
-
<imgsrc="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_latency.png"alt="Latency and Memory Usage"width="600"/>
65
-
<imgsrc="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_peak_allocated_memory.png"alt="Latency and Memory Usage"width="600"/>
64
+
<imgsrc="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_latency.png"alt="Latency and Memory Usage"width="45%"/>
65
+
<imgsrc="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_peak_allocated_memory.png"alt="Latency and Memory Usage"width="45%"/>
66
66
</div>
67
67
68
68
## Flash Attention (Build-from-Source) vs. Hub Kernels
Copy file name to clipboardExpand all lines: docs/source/reducing_memory_usage.md
+1-102Lines changed: 1 addition & 102 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ To reduce memory usage, it's important to truncate sequences to a reasonable len
22
22
DPO truncation is applied first to the prompt and to the completion via the `max_prompt_length` and `max_completion_length` parameters. The `max_length` parameter is then used to truncate the resulting sequence.
This adjustment prevents model weights from being gathered, avoiding OOM errors, but it may result in slower generation speeds.
264
264
265
-
## Context Parallelism
266
-
267
-
Context Parallelism (CP) is a parallelization technique that enables training with longer sequences by splitting the sequence dimension across multiple GPUs. Each GPU processes a portion of the sequence, allowing you to train with sequences longer than what would fit on a single GPU's memory.
268
-
269
-
For more details on CP, see the [Ultrascale Playbook - Context Parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism).
270
-
271
-
CP is particularly useful when:
272
-
273
-
- You want to train with very long sequences (>32k tokens)
274
-
- Single GPU memory is insufficient for your desired sequence length
275
-
- You need to maintain sequence coherence across the full context
276
-
277
-
### Requirements and Limitations
278
-
279
-
CP has specific requirements:
280
-
281
-
1.**Accelerate 1.10 or higher** is required
282
-
2.**FSDP2 (PyTorch FSDP v2)** is required as the distributed training backend
283
-
3.**SDPA attention** - Flash Attention is currently not supported with CP
284
-
4.**Sequence length divisibility** - sequences must be divisible by `cp_size * 2`. This is now automatically handled using the `pad_to_multiple_of` parameter in the data collator, which works seamlessly with both standard and padding-free modes.
285
-
286
-
### Configuration
287
-
288
-
To enable CP, you need to configure both Accelerate and your training arguments:
289
-
290
-
#### Accelerate Configuration
291
-
292
-
Use one of the provided accelerate config files (e.g. `fsdp_context_parallel_2gpu.yaml` for 2 GPUs):
1.**Use the `pad_to_multiple_of` parameter** - This is now the recommended way to ensure sequence length divisibility:
352
-
- For `cp_size=2`: use `pad_to_multiple_of=4` (since `cp_size * 2 = 4`)
353
-
- For `cp_size=4`: use `pad_to_multiple_of=8` (since `cp_size * 2 = 8`)
354
-
- The data collator automatically pads sequences to the required multiple, ensuring compatibility with CP
355
-
356
-
2.**Use packing with padding** - The default BFD (Best Fit Decreasing) strategy works perfectly:
357
-
- Preserves sequence boundaries and maintains training quality
358
-
- Works seamlessly with both `padding_free=True` and standard padding modes
359
-
360
-
3.**Combine with other memory optimizations** like Liger kernels, bfloat16, and gradient checkpointing
361
-
362
-
4.**Start with smaller context parallel sizes** (2-4 GPUs) before scaling up
363
-
364
-
5.**Monitor memory usage** across all GPUs to ensure balanced workload
365
-
366
265
## vLLM sleep mode
367
266
368
267
When using vLLM as the generation backend, you can enable _sleep mode_ to offload vLLM parameters and cache to CPU RAM during the optimization step and reload them back to GPU VRAM when needed for weight synchronization and generation.
0 commit comments