[FSDP2] full finetune: move state dict to cpu when cpu offloading #1495

weifengpy · 2024-09-04T22:34:33Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

resolve: #1412

full finetune with cpu offload: tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=True

full finetune without cpu offload: tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full fsdp_cpu_offload=False

lora: tune run --nproc_per_node 2 lora_finetune_fsdp2 --config llama2/7B_lora

qlora: tune run --nproc_per_node 2 lora_finetune_fsdp2 --config llama2/7B_qlora

when cpu offloading, we can move state dict to cpu to avoid peaking memory. As the snapshot shows, peak memory dropped from 7GB to 3GB:

memory behavior before

memory behavior after

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-09-04T22:34:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1495

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fedaa32 with merge base 5fcb931 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

weifengpy · 2024-09-04T22:36:04Z

torchtune/training/_distributed.py

@@ -584,4 +587,4 @@ def shard_model(
            fully_shard(m, **fsdp_kwargs)

    # Finally shard the entire model to account for any stragglers
-    fully_shard(model)


this seems to be dropping **fsdp_kwargs by accident? it prevents cpu_offloading for root model

Yeah I had meant to fix this, thanks for adding it here!

weifengpy · 2024-09-04T22:37:41Z

torchtune/training/_distributed.py

@@ -338,6 +339,8 @@ def load_from_full_model_state_dict(
                sharded_meta_param.device_mesh,
                sharded_meta_param.placements,
            )
+        if cpu_offload:
+            sharded_tensor = sharded_tensor.cpu()


sharded_tensor have device=cuda because distribute_tensor/DTensor requires NCCL. For cpu offloading, we can move DTensor to device=cpu afterwards to avoid peaking memory

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-09-04T23:26:24Z

this torchtune side fix alone can resolve the problem, because we move state dict to cpu when fsdp_cpu_offload=True

pytorch side fix covers cases when state dict are on gpu

ebsmothers · 2024-09-04T23:39:55Z

Thank you @weifengpy for figuring this out and landing the fix!

ebsmothers · 2024-09-04T23:47:39Z

Btw @weifengpy I assume for save_checkpoint we will need to do the inverse operation, right? Move from CPU back to current device prior to resharding?

weifengpy · 2024-09-05T00:40:39Z

Btw @weifengpy I assume for save_checkpoint we will need to do the inverse operation, right? Move from CPU back to current device prior to resharding?

you're right. I need to update the PR to cover save_checkpoint

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

awgu · 2024-09-05T04:11:15Z

torchtune/training/_distributed.py

@@ -338,6 +339,8 @@ def load_from_full_model_state_dict(
                sharded_meta_param.device_mesh,
                sharded_meta_param.placements,
            )
+        if cpu_offload:
+            sharded_tensor = sharded_tensor.cpu()


This change makes sense to me. I am not sure if we should support the user trying to load a GPU state dict into an FSDP module that has CPU offloading enabled. We can provide a better error message though.

got you. this reminds me of my overdued BE to check cpu device in lazy_init when cpu offloading is enabled

move state dict to cpu when cpu offloading

8adb728

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2024

weifengpy commented Sep 4, 2024

View reviewed changes

remove debug

4582779

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy requested a review from ebsmothers September 4, 2024 22:42

linter

afec67c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

ebsmothers approved these changes Sep 4, 2024

View reviewed changes

weifengpy added 3 commits September 4, 2024 19:45

[FSDP2] move param from cpu to gpu when covnerting to full state dict

ee04e93

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

linter

640b96f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove debug

fedaa32

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy mentioned this pull request Sep 5, 2024

Full finetune recipe not working with FSDP2 CPU offload #1412

Open

weifengpy merged commit f437639 into pytorch:main Sep 5, 2024
17 checks passed

awgu reviewed Sep 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP2] full finetune: move state dict to cpu when cpu offloading #1495

[FSDP2] full finetune: move state dict to cpu when cpu offloading #1495

weifengpy commented Sep 4, 2024 •

edited

Loading

pytorch-bot bot commented Sep 4, 2024 •

edited

Loading

weifengpy Sep 4, 2024

ebsmothers Sep 4, 2024

weifengpy Sep 4, 2024

weifengpy commented Sep 4, 2024

ebsmothers commented Sep 4, 2024

ebsmothers commented Sep 4, 2024

weifengpy commented Sep 5, 2024

awgu Sep 5, 2024

weifengpy Sep 5, 2024

[FSDP2] full finetune: move state dict to cpu when cpu offloading #1495

[FSDP2] full finetune: move state dict to cpu when cpu offloading #1495

Conversation

weifengpy commented Sep 4, 2024 • edited Loading

Context

pytorch-bot bot commented Sep 4, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1495

✅ No Failures

weifengpy Sep 4, 2024

Choose a reason for hiding this comment

ebsmothers Sep 4, 2024

Choose a reason for hiding this comment

weifengpy Sep 4, 2024

Choose a reason for hiding this comment

weifengpy commented Sep 4, 2024

ebsmothers commented Sep 4, 2024

ebsmothers commented Sep 4, 2024

weifengpy commented Sep 5, 2024

awgu Sep 5, 2024

Choose a reason for hiding this comment

weifengpy Sep 5, 2024

Choose a reason for hiding this comment

weifengpy commented Sep 4, 2024 •

edited

Loading

pytorch-bot bot commented Sep 4, 2024 •

edited

Loading