Add layer-wise activation recomputation to llama model #207
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When benchmarking Nanotron with big models (e.g. Llama2-70b) we found that Nanotron consumes more memory than Megatron under the same parallel configuration. Besides the fix #203, we also add layer-wise activation recomputation (also known as gradient checkpointing) to mitigate this issue. Currently it is controlled by flag
recompute_layer
insrc/nanotron/config/parallelism_config.py
.Megatron also offers more fine-grained control over activation recomputation, but I found it is not straightforward to apply that to Nanotron mainly because of the pipelineblock abstraction. It would be nice if developers could shed some light on this. And I would be glad to help formalize this PR.