Skip to content

Commit 4bd4acf

Browse files
sergiopaniegokashifqgallouedec
authored
🏞️ Context Parallelism benchmark guide (#4075)
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
1 parent 8380869 commit 4bd4acf

File tree

4 files changed

+143
-107
lines changed

4 files changed

+143
-107
lines changed

docs/source/distributing_training.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,143 @@ Having one model per GPU can lead to high memory usage, which may not be feasibl
5555

5656
</Tip>
5757

58+
## Context Parallelism
59+
60+
Context Parallelism (CP) is a parallelization technique that enables training with longer sequences by splitting the sequence dimension across multiple GPUs. Each GPU processes a portion of the sequence, allowing you to train with sequences longer than what would fit on a single GPU's memory.
61+
62+
For more details on CP, see the [Ultrascale Playbook - Context Parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism).
63+
64+
CP is particularly useful when:
65+
66+
- You want to train with very long sequences (>32k tokens)
67+
- Single GPU memory is insufficient for your desired sequence length
68+
- You need to maintain sequence coherence across the full context
69+
70+
### Requirements and Limitations
71+
72+
CP has specific requirements:
73+
74+
1. **Accelerate 1.10 or higher** is required
75+
2. **FSDP2 (PyTorch FSDP v2)** is required as the distributed training backend
76+
3. **SDPA attention** - Flash Attention is currently not supported with CP
77+
4. **Sequence length divisibility** - sequences must be divisible by `cp_size * 2`. This is now automatically handled using the `pad_to_multiple_of` parameter in the data collator, which works seamlessly with both standard and padding-free modes.
78+
79+
### Configuration
80+
81+
To enable CP, you need to configure both Accelerate and your training arguments:
82+
83+
#### Accelerate Configuration
84+
85+
Use one of the provided accelerate config files (e.g. [`context_parallel_2gpu.yaml`](https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/context_parallel_2gpu.yaml) for 2 GPUs):
86+
87+
```yaml
88+
compute_environment: LOCAL_MACHINE
89+
debug: false
90+
distributed_type: FSDP
91+
downcast_bf16: 'no'
92+
enable_cpu_affinity: false
93+
fsdp_config:
94+
fsdp_activation_checkpointing: true # Enable activation checkpointing for memory efficiency
95+
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
96+
fsdp_cpu_ram_efficient_loading: true
97+
fsdp_offload_params: false
98+
fsdp_reshard_after_forward: true
99+
fsdp_state_dict_type: FULL_STATE_DICT
100+
fsdp_version: 2
101+
machine_rank: 0
102+
main_training_function: main
103+
mixed_precision: bf16
104+
num_machines: 1
105+
num_processes: 2 # Number of GPUs
106+
rdzv_backend: static
107+
same_network: true
108+
tpu_env: []
109+
tpu_use_cluster: false
110+
tpu_use_sudo: false
111+
use_cpu: false
112+
parallelism_config:
113+
parallelism_config_dp_replicate_size: 1
114+
parallelism_config_dp_shard_size: 1
115+
parallelism_config_tp_size: 1
116+
parallelism_config_cp_size: 2 # Context parallel size
117+
```
118+
119+
#### Training Configuration
120+
121+
```python
122+
from trl import SFTConfig
123+
124+
training_args = SFTConfig(
125+
# required
126+
pad_to_multiple_of=4, # ensures divisibility by cp_size * 2
127+
# to get the most out of CP
128+
max_length=16384, # long sequence length
129+
packing=True, # use packing to reduce padding
130+
use_liger_kernel=True, # compatible with CP
131+
gradient_checkpointing=False, # The activation_checkpointing in FSDP config and the gradient_checkpointing in training arg can't be set to True simultaneously
132+
per_device_train_batch_size=1,
133+
...
134+
)
135+
```
136+
137+
Then, launch your training script with the appropriate accelerate config file:
138+
139+
```bash
140+
accelerate launch --config_file context_parallel_2gpu.yaml train.py
141+
```
142+
143+
### Best Practices
144+
145+
1. **Use the `pad_to_multiple_of` parameter** - This is now the recommended way to ensure sequence length divisibility:
146+
- For `cp_size=2`: use `pad_to_multiple_of=4` (since `cp_size * 2 = 4`)
147+
- For `cp_size=4`: use `pad_to_multiple_of=8` (since `cp_size * 2 = 8`)
148+
- The data collator automatically pads sequences to the required multiple, ensuring compatibility with CP
149+
150+
2. **Use packing with padding** - The default BFD (Best Fit Decreasing) strategy works perfectly:
151+
- Preserves sequence boundaries and maintains training quality
152+
- Works seamlessly with both `padding_free=True` and standard padding modes
153+
154+
3. **Combine with other memory optimizations** like Liger kernels, bfloat16, and gradient checkpointing
155+
156+
4. **Start with smaller context parallel sizes** (2-4 GPUs) before scaling up
157+
158+
5. **Monitor memory usage** across all GPUs to ensure balanced workload
159+
160+
### Benchmarking Context Parallelism
161+
162+
We benchmarked CP to highlight its potential improvements in training efficiency.
163+
Our experiments were conducted using **1, 2, 4, and 8 H100 GPUs**, though the results can be extended to larger clusters with more nodes and GPUs.
164+
165+
For the setup, we fine-tuned an **8B model** ([Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) using the provided accelerate configuration
166+
([`context_parallel_2gpu.yaml`](https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/context_parallel_2gpu.yaml)).
167+
We adjusted `num_processes` and `parallelism_config_cp_size` based on the number of GPUs for each run.
168+
Training was performed with the [sft.py](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py) example script, combined with the parameters described above.
169+
170+
The results below summarize the **maximum trainable sequence length** and **iterations per second** for different numbers of GPUs. A value marked as `OOM` indicates that the configuration ran out of memory and could not be trained.
171+
172+
These results show that **Context Parallelism (CP) scales effectively with more GPUs**, enabling training on much longer sequences. With **8 GPUs**, context lengths of over **300k tokens** become feasible, unlocking training with extremely long contexts while maintaining reasonable throughput.
173+
174+
<div class="flex justify-center">
175+
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/context_parallelism_max_length_plot.png" alt="CP Max content length" width="45%"/>
176+
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/context_parallelism_s_it_plot.png" alt="CP seconds/iteration" width="45%"/>
177+
</div>
178+
179+
<Tip>
180+
181+
Accelerate also supports **N-Dimensional Parallelism (ND-parallelism)**, which enables you to combine different parallelization strategies to efficiently distribute model training across multiple GPUs.
182+
183+
You can learn more and explore configuration examples in the [Accelerate ND-parallelism guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism).
184+
185+
</Tip>
186+
187+
188+
**Further Reading on Context Parallelism**
189+
190+
- [Accelerate: Context Parallelism Guide](https://github.com/huggingface/accelerate/blob/main/docs/source/concept_guides/context_parallelism.md)
191+
- [Accelerate Example: 128k Sequence Length](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#context-parallelism-128k-sequence-length)
192+
- [Hugging Face Blog: Enabling Long-Context Training with Sequence Parallelism in Axolotl](https://huggingface.co/blog/axolotl-ai-co/long-context-with-sequence-parallelism-in-axolotl)
193+
- [Snowflake Engineering Blog: Arctic Long Sequence Training (ALST) — Scalable and Efficient Training for Multi-Million Token Sequences (Note that they use a different strategy)](https://www.snowflake.com/en/engineering-blog/arctic-long-sequence-training-multi-million-token-ai/)
194+
58195
## Multi-Node Training
59196

60197
We're working on a guide for multi-node training. Stay tuned! 🚀

docs/source/kernels_hub.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,8 @@ Kernel-based implementations perform on par with custom-installed attention, and
6161

6262

6363
<div class="flex justify-center">
64-
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_latency.png" alt="Latency and Memory Usage" width="600"/>
65-
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_peak_allocated_memory.png" alt="Latency and Memory Usage" width="600"/>
64+
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_latency.png" alt="Latency and Memory Usage" width="45%"/>
65+
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/kernels_guide_peak_allocated_memory.png" alt="Latency and Memory Usage" width="45%"/>
6666
</div>
6767

6868
## Flash Attention (Build-from-Source) vs. Hub Kernels

docs/source/reducing_memory_usage.md

Lines changed: 1 addition & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ To reduce memory usage, it's important to truncate sequences to a reasonable len
2222
DPO truncation is applied first to the prompt and to the completion via the `max_prompt_length` and `max_completion_length` parameters. The `max_length` parameter is then used to truncate the resulting sequence.
2323

2424
<div class="flex justify-center">
25-
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/truncation_prompt_completion.png" alt="Truncation prompt-completion" width="600"/>
25+
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/truncation_prompt_completion.png" alt="DPO truncation" width="600"/>
2626
</div>
2727

2828
To set the truncation parameters, use the following code snippet:
@@ -262,107 +262,6 @@ training_args = RLOOConfig(..., ds3_gather_for_generation=False)
262262

263263
This adjustment prevents model weights from being gathered, avoiding OOM errors, but it may result in slower generation speeds.
264264

265-
## Context Parallelism
266-
267-
Context Parallelism (CP) is a parallelization technique that enables training with longer sequences by splitting the sequence dimension across multiple GPUs. Each GPU processes a portion of the sequence, allowing you to train with sequences longer than what would fit on a single GPU's memory.
268-
269-
For more details on CP, see the [Ultrascale Playbook - Context Parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism).
270-
271-
CP is particularly useful when:
272-
273-
- You want to train with very long sequences (>32k tokens)
274-
- Single GPU memory is insufficient for your desired sequence length
275-
- You need to maintain sequence coherence across the full context
276-
277-
### Requirements and Limitations
278-
279-
CP has specific requirements:
280-
281-
1. **Accelerate 1.10 or higher** is required
282-
2. **FSDP2 (PyTorch FSDP v2)** is required as the distributed training backend
283-
3. **SDPA attention** - Flash Attention is currently not supported with CP
284-
4. **Sequence length divisibility** - sequences must be divisible by `cp_size * 2`. This is now automatically handled using the `pad_to_multiple_of` parameter in the data collator, which works seamlessly with both standard and padding-free modes.
285-
286-
### Configuration
287-
288-
To enable CP, you need to configure both Accelerate and your training arguments:
289-
290-
#### Accelerate Configuration
291-
292-
Use one of the provided accelerate config files (e.g. `fsdp_context_parallel_2gpu.yaml` for 2 GPUs):
293-
294-
```yaml
295-
compute_environment: LOCAL_MACHINE
296-
debug: false
297-
distributed_type: FSDP
298-
downcast_bf16: 'no'
299-
enable_cpu_affinity: false
300-
fsdp_config:
301-
fsdp_activation_checkpointing: false
302-
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
303-
fsdp_cpu_ram_efficient_loading: true
304-
fsdp_offload_params: false
305-
fsdp_reshard_after_forward: true
306-
fsdp_state_dict_type: FULL_STATE_DICT
307-
fsdp_version: 2
308-
machine_rank: 0
309-
main_training_function: main
310-
mixed_precision: bf16
311-
num_machines: 1
312-
num_processes: 2 # Number of GPUs
313-
rdzv_backend: static
314-
same_network: true
315-
tpu_env: []
316-
tpu_use_cluster: false
317-
tpu_use_sudo: false
318-
use_cpu: false
319-
parallelism_config:
320-
parallelism_config_dp_replicate_size: 1
321-
parallelism_config_dp_shard_size: 1
322-
parallelism_config_tp_size: 1
323-
parallelism_config_cp_size: 2 # Context parallel size
324-
```
325-
326-
#### Training Configuration
327-
328-
```python
329-
from trl import SFTConfig
330-
331-
training_args = SFTConfig(
332-
# required
333-
pad_to_multiple_of=4, # ensures divisibility by cp_size * 2
334-
# to get the most out of CP
335-
max_length=16384, # long sequence length
336-
packing=True, # use packing to reduce padding
337-
use_liger_kernel=True, # compatible with CP
338-
per_device_train_batch_size=1,
339-
...
340-
)
341-
```
342-
343-
Then, launch your training script with the appropriate accelerate config file:
344-
345-
```bash
346-
accelerate launch --config_file fsdp_context_parallel_2gpu.yaml train.py
347-
```
348-
349-
### Best Practices
350-
351-
1. **Use the `pad_to_multiple_of` parameter** - This is now the recommended way to ensure sequence length divisibility:
352-
- For `cp_size=2`: use `pad_to_multiple_of=4` (since `cp_size * 2 = 4`)
353-
- For `cp_size=4`: use `pad_to_multiple_of=8` (since `cp_size * 2 = 8`)
354-
- The data collator automatically pads sequences to the required multiple, ensuring compatibility with CP
355-
356-
2. **Use packing with padding** - The default BFD (Best Fit Decreasing) strategy works perfectly:
357-
- Preserves sequence boundaries and maintains training quality
358-
- Works seamlessly with both `padding_free=True` and standard padding modes
359-
360-
3. **Combine with other memory optimizations** like Liger kernels, bfloat16, and gradient checkpointing
361-
362-
4. **Start with smaller context parallel sizes** (2-4 GPUs) before scaling up
363-
364-
5. **Monitor memory usage** across all GPUs to ensure balanced workload
365-
366265
## vLLM sleep mode
367266

368267
When using vLLM as the generation backend, you can enable _sleep mode_ to offload vLLM parameters and cache to CPU RAM during the optimization step and reload them back to GPU VRAM when needed for weight synchronization and generation.

examples/accelerate_configs/context_parallel_2gpu.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ distributed_type: FSDP
55
downcast_bf16: 'no'
66
enable_cpu_affinity: false
77
fsdp_config:
8-
fsdp_activation_checkpointing: false
8+
fsdp_activation_checkpointing: true # Enable activation checkpointing for memory efficiency
99
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
1010
fsdp_cpu_ram_efficient_loading: true
1111
fsdp_offload_params: false
@@ -16,7 +16,7 @@ machine_rank: 0
1616
main_training_function: main
1717
mixed_precision: bf16
1818
num_machines: 1
19-
num_processes: 2
19+
num_processes: 2 # Number of GPUs
2020
rdzv_backend: static
2121
same_network: true
2222
tpu_env: []
@@ -27,4 +27,4 @@ parallelism_config:
2727
parallelism_config_dp_replicate_size: 1
2828
parallelism_config_dp_shard_size: 1
2929
parallelism_config_tp_size: 1
30-
parallelism_config_cp_size: 2
30+
parallelism_config_cp_size: 2 # Context parallel size

0 commit comments

Comments
 (0)