Skip to content

Commit

Permalink
Do not wrap LoRA layers with FSDP
Browse files Browse the repository at this point in the history
When wrapping the full Transformer `Block`, FSDP wraps both trainable
and non-trainable parameters. This results in way higher memory
consumption, making the memory savings from LoRA meaningless.

By instead wrapping only `torch.nn.Linear` modules, we still make use of
FSDP but avoid wrapping the LoRA layers.
  • Loading branch information
janEbert committed Jun 28, 2024
1 parent 7d0430a commit aa965cc
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion litgpt/finetune/lora.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ def setup(
" --quantize flag."
)
strategy = FSDPStrategy(
auto_wrap_policy={Block},
auto_wrap_policy={torch.nn.Linear},
activation_checkpointing_policy={Block},
state_dict_type="full",
limit_all_gathers=True,
Expand Down

0 comments on commit aa965cc

Please sign in to comment.