Add layer-wise activation recomputation to llama model #207

C-TC · 2024-07-08T11:56:06Z

When benchmarking Nanotron with big models (e.g. Llama2-70b) we found that Nanotron consumes more memory than Megatron under the same parallel configuration. Besides the fix #203, we also add layer-wise activation recomputation (also known as gradient checkpointing) to mitigate this issue. Currently it is controlled by flag recompute_layer in src/nanotron/config/parallelism_config.py.

Megatron also offers more fine-grained control over activation recomputation, but I found it is not straightforward to apply that to Nanotron mainly because of the pipelineblock abstraction. It would be nice if developers could shed some light on this. And I would be glad to help formalize this PR.

3outeille · 2024-07-14T11:59:26Z

lgtm !

C-TC added 2 commits May 14, 2024 18:57

wip

e484d99

layer recompute

7e15516

3outeille merged commit 4c23ed0 into huggingface:main Jul 14, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add layer-wise activation recomputation to llama model #207

Add layer-wise activation recomputation to llama model #207

C-TC commented Jul 8, 2024

3outeille commented Jul 14, 2024

Add layer-wise activation recomputation to llama model #207

Add layer-wise activation recomputation to llama model #207

Conversation

C-TC commented Jul 8, 2024

3outeille commented Jul 14, 2024