Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add layer-wise activation recomputation to llama model #207

Merged
merged 2 commits into from
Jul 14, 2024

Conversation

C-TC
Copy link
Contributor

@C-TC C-TC commented Jul 8, 2024

When benchmarking Nanotron with big models (e.g. Llama2-70b) we found that Nanotron consumes more memory than Megatron under the same parallel configuration. Besides the fix #203, we also add layer-wise activation recomputation (also known as gradient checkpointing) to mitigate this issue. Currently it is controlled by flag recompute_layer in src/nanotron/config/parallelism_config.py.

Megatron also offers more fine-grained control over activation recomputation, but I found it is not straightforward to apply that to Nanotron mainly because of the pipelineblock abstraction. It would be nice if developers could shed some light on this. And I would be glad to help formalize this PR.

@3outeille
Copy link
Member

lgtm !

@3outeille 3outeille merged commit 4c23ed0 into huggingface:main Jul 14, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants