Do not wrap LoRA layers with FSDP

When wrapping the full Transformer `Block`, FSDP wraps both trainable and non-trainable parameters. This results in way higher memory consumption, making the memory savings from LoRA meaningless. By instead wrapping only `torch.nn.Linear` modules, we still make use of FSDP but avoid wrapping the LoRA layers.
Lightning-AI · Jun 28, 2024 · aa965cc · aa965cc
1 parent 7d0430a
commit aa965cc
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/litgpt/finetune/lora.py b/litgpt/finetune/lora.py
@@ -140,7 +140,7 @@ def setup(
                 " --quantize flag."
             )
         strategy = FSDPStrategy(
-            auto_wrap_policy={Block},
+            auto_wrap_policy={torch.nn.Linear},
             activation_checkpointing_policy={Block},
             state_dict_type="full",
             limit_all_gathers=True,