enable QLoRA + FSDP2 #909

weifengpy · 2024-05-01T06:44:34Z

this PR is built on top of

TorchAO nightly that contains NF4Tensor FSDP2 ops PR1 PR2
Pytorch nightly that contains meta init + cpu offloading PR

unit test: pytest -s tests/torchtune/utils/test_distributed.py -k test_qlora_state_dict

attaching snapshot on 2 runs

QLoRA + FSDP2 on 8 GPUs: tune run --nnodes 1 --nproc_per_node 8 lora_finetune_fsdp2 --config llama2/7B_qlora
QLoRA on 1 GPU: tune run lora_finetune_single_device --config llama2/7B_qlora_single_device

numerics

tokens_per_second_per_gpu: 1st is A100, 2nd is H100. H100 scales better with faster memory read/write

memory

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

use torchao copy_

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

enable saving checkpoint

pytorch-bot · 2024-05-01T06:44:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/909

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cbb3da8 with merge base 71741df ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-06-04T19:58:31Z

torchtune/modules/low_precision/_register_nf4_dispatch_ops.py

@@ -17,26 +17,3 @@ def clone(func, *args, **kwargs):
    in precision.
    """
    return to_nf4(args[0][0].get_original_weight())
-
-


starting from TorchAO==0.2.0, we implemented NF4.copy_. It's the superset of inplace_copy. We cover both NF4.copy_(bf16) and NF4.copy_(NF4)

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-06-04T22:19:00Z

recipes/configs/dev/llama2/7B_qlora.yaml

+  _component_: torch.nn.CrossEntropyLoss
+
+fsdp:
+  cpu_offload: False


comparing with 7B_qlora_single_device.yaml, cpu_offload is the new feature

Strictly speaking you probably do not need to put cpu_offload config under fsdp (unless you think it helps for clarity, or anticipate other kwargs that we'd wanna pass through to fully_shard). No strong preference either way here though

I know Less has been working on cpu_offload for activations, which might be a root config. It will be complete different thing than cpu_offload for parameters in fsdp2

another things is, other fully_shard kwargs like reshard_after_forward=False can boost QPS

Does it make sense to you?

Ah yeah this is a good point, makes sense to me. Thanks for clarifying

weifengpy · 2024-06-04T22:25:29Z

recipes/configs/dev/llama2/7B_qlora.yaml

+#   tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token <HF_TOKEN>
+#
+# To launch on a single device, run the following command from root:
+#   tune run lora_finetune_single_device --config llama2/7B_qlora_single_device


will replace lora_finetune_single_device to lora_finetune_fsdp2 before landing. saving CI to publish PR earlier now

weifengpy · 2024-06-04T22:27:41Z

recipes/dev/lora_finetune_fsdp2.py

-            # Before loading the state dict, ensure the state dict keys for the base
-            # model and adapters (if available) match the keys in the full LoRA model
-            # This is a good sanity check to prevent silent errors
-            validate_state_dict_for_lora(


validate_state_dict_for_lora needs some refactor since FSDP2 use clean FQNs.
FSDP2 also check param on meta and throw Runtime error if any https://fburl.com/wrhxykn3

"FSDP parameters should be materialized from meta device before training, " f"but the following were still on meta device: {param_names_on_meta}\n" "For example, call module.to_empty(device) to materialize to device and " "call module.reset_parameters() on each module to initialize values."

Oh yeah this is what I was alluding to in the previous PR. You can try validate_missing_and_unexpected_for_lora as in the single-device recipe now that we have the correct FQN. If it doesn't work out of the box feel free to leave this as-is, we can come back to refactor after.

validate_missing_and_unexpected_for_lora works out of the box. added it in this PR. finger acrossed!

recipes/dev/lora_finetune_fsdp2.py

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-06-04T23:27:14Z

recipes/configs/dev/llama2/70B_qlora.yaml

+#
+
+# Model Arguments
+model:


comparing with 70B_lora.yaml, added output_proj, apply_lora_to_mlp: True

the difference is similar to 7B_lora.yaml vs 7B_qlora_single_device.yaml

ebsmothers · 2024-06-04T22:45:43Z

recipes/configs/dev/llama2/7B_qlora.yaml

@@ -0,0 +1,91 @@
+# Config for single device QLoRA with lora_finetune_single_device.py


Should we rename this config as 7B_qlora_fsdp2.yaml to match the LoRA one?

good point! updating now

ebsmothers · 2024-06-04T22:54:44Z

recipes/dev/lora_finetune_fsdp2.py

-            # Before loading the state dict, ensure the state dict keys for the base
-            # model and adapters (if available) match the keys in the full LoRA model
-            # This is a good sanity check to prevent silent errors
-            validate_state_dict_for_lora(


Oh yeah this is what I was alluding to in the previous PR. You can try validate_missing_and_unexpected_for_lora as in the single-device recipe now that we have the correct FQN. If it doesn't work out of the box feel free to leave this as-is, we can come back to refactor after.

recipes/dev/lora_finetune_fsdp2.py

ebsmothers · 2024-06-04T23:06:56Z

recipes/configs/dev/llama2/7B_qlora.yaml

+  _component_: torch.nn.CrossEntropyLoss
+
+fsdp:
+  cpu_offload: False


Strictly speaking you probably do not need to put cpu_offload config under fsdp (unless you think it helps for clarity, or anticipate other kwargs that we'd wanna pass through to fully_shard). No strong preference either way here though

ebsmothers · 2024-06-04T23:09:24Z

torchtune/utils/_distributed.py

+            )
+        else:
+            sharded_tensor = distribute_tensor(
+                full_tensor,


No longer need to convert dtype as before?

it's moved to line 284 full_tensor = full_tensor.to(sharded_meta_param.dtype).to(device). The extra .to(device) is for NF4 since quantization on cpu is prohibitively slow

ebsmothers · 2024-06-04T23:28:00Z

torchtune/utils/_distributed.py

+        if isinstance(sharded_meta_param._local_tensor, NF4Tensor):
+            full_tensor = to_nf4(full_tensor)
+            # replicating logic from `_fsdp_param.py`` `_init_sharded_param`
+            # otherwise `distribute_tensor(DTensor(local=NF4))`
+            # requires dispatching `c10d.scatter_``
+            # long-term solution is `swap_tensor`
+            mesh = sharded_meta_param.device_mesh
+            if mesh.ndim > 1:
+                raise NotImplementedError(f"only support 1D FSDP but got {mesh.ndim=}")
+            shard_mesh_dim = 0
+            shard_world_size = mesh.size(shard_mesh_dim)
+            shard_rank = cast(
+                torch.distributed.ProcessGroup, mesh.get_group(shard_mesh_dim)
+            ).rank()
+            chunk = list(torch.chunk(full_tensor, shard_world_size, dim=0))[shard_rank]
+            sharded_param = full_tensor.new_zeros(chunk.size())
+            sharded_param[: chunk.size(0)].copy_(chunk)
+            sharded_tensor = DTensor(
+                sharded_param,
+                sharded_meta_param.device_mesh,
+                sharded_meta_param.placements,
+                shape=sharded_meta_param.size(),
+                dtype=sharded_meta_param.dtype,
+                requires_grad=sharded_meta_param.requires_grad,
+                stride=sharded_meta_param.stride(),
+            )


If it's not too much work, is it possible to extend the unit test you added in #855 to cover this case as well? As is I find it a bit hard to follow and want to make sure we have a reliable sanity check in case anything breaks in the future

you're right. will try to come up with a unit test to cover NF4

@ebsmothers I added pytest -s tests/torchtune/utils/test_distributed.py -k test_qlora_state_dict

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

ebsmothers

This is looking good! As discussed offline we can separate out the checkpoint save support into a follow-up PR to unblock other ongoing work. Once CI is green this one should be good to land. Thank you for enabling a huge huge feature for our users!

weifengpy · 2024-06-05T08:09:53Z

merged after confirming CI's all green. Will follow up on saving state dict

Co-authored-by: Kartikay Khandelwal <47255723+kartikayk@users.noreply.github.com> Co-authored-by: ebsmothers <ebs@meta.com> Co-authored-by: Rafi Ayub <33648637+RdoubleA@users.noreply.github.com> Co-authored-by: Joe Cummings <jrcummings27@gmail.com> Co-authored-by: Salman Mohammadi <salman.mohammadi@outlook.com> Co-authored-by: Rohan Varma <rvarm1@fb.com> Co-authored-by: Optimox <sebastien.fischman@gmail.com> Co-authored-by: Tanish Ambulkar <tanish.ambulkar99@gmail.com> Co-authored-by: Botao Chen <markchen1015@meta.com> Co-authored-by: solitude-alive <44771751+solitude-alive@users.noreply.github.com> Co-authored-by: christobill <christobill@users.noreply.github.com> Co-authored-by: Philip Bontrager <pbontrager@gmail.com> Co-authored-by: Evan Smothers <ebs@fb.com>

weifengpy and others added 15 commits April 23, 2024 17:45

enable LoRA + FSDP2

e5826a1

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

reset params for lora weights and rope

64fc870

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

support lora weights checkpoint and checkpoint utils

0cd21c6

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

fix lora meta device bug

589191e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

save optim state dict

c801f26

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

mark TODO

19a2d70

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

optimizer foreach=True for DTensor

441da10

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

clip grad norm

750b9e5

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

switch to ptd state dict api

3d632d5

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

add profiler

cb3abb3

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

qlora 7b config

dfcdde3

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

use torchao copy_

e68804a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge pull request #1 from weifengpy/fsdp2

b6fad93

use torchao copy_

enable saving checkpoint

d6af9a2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge pull request #2 from weifengpy/fsdp2

7bbe522

enable saving checkpoint

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 1, 2024

weifengpy marked this pull request as draft May 1, 2024 06:45

weifengpy mentioned this pull request May 1, 2024

[In Progress] FSDP2 + NF4Tensor #651

Closed

weifengpy and others added 11 commits May 1, 2024 00:33

optimizer state dict: load on rank0 and broadcast

b616394

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

import Optimizer

a400497

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

resume training

e9de63c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

prepare for full test

05d3895

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

prepare for full test

7a5bb80

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove profiler

64bf49c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

passed integration test

cb1bba4

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove uncesssary change

ac516e9

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'main' into fsdp2

bfde704

bring back state dict validation

102db31

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

align indent on comment

0b66651

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy added 5 commits June 3, 2024 23:57

del logits to save memory

2835d2a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

fix linter

559b81d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

gate NF4.copy_ on TorchAO==0.2.0

85f978b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

improve torchao gating comment

dbae23c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

upgrade torchao to 0.2

f4a8dfa

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy commented Jun 4, 2024

View reviewed changes

gate torchao 0.2

10e304d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy commented Jun 4, 2024

View reviewed changes

recipes/dev/lora_finetune_fsdp2.py Show resolved Hide resolved

weifengpy marked this pull request as ready for review June 4, 2024 22:32

weifengpy requested a review from ebsmothers June 4, 2024 22:33

weifengpy added 3 commits June 4, 2024 16:05

replace with lora_finetune_fsdp2

e117a21

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

add llama2-70B

4bb5e0f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

replace with qlora and lora_finetune_fsdp2 in yaml

174d916

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy commented Jun 4, 2024

View reviewed changes

ebsmothers reviewed Jun 4, 2024

View reviewed changes

weifengpy added 3 commits June 4, 2024 17:05

rename yaml to _fsdp2.yaml

5fdcefb

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

add unit test for nf4 state dict

b878018

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

python 3.8 style dict union

a8f1a9a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy changed the title ~~[WIP] enable QLoRA + FSDP2~~ enable QLoRA + FSDP2 Jun 5, 2024

weifengpy added 2 commits June 4, 2024 20:26

validate lora sd missing

ae49684

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

skip test if <2 gpu

cbb3da8

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

ebsmothers approved these changes Jun 5, 2024

View reviewed changes

weifengpy merged commit f9cb9e6 into pytorch:main Jun 5, 2024
29 checks passed

weifengpy mentioned this pull request Jul 15, 2024

[WIP] full finetune / qlora + ac/offload/optm in bwd drisspg/transformer_nuggets#21

Closed

RdoubleA mentioned this pull request Jul 19, 2024

Multi-GPU QLoRA? #844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable QLoRA + FSDP2 #909

enable QLoRA + FSDP2 #909

weifengpy commented May 1, 2024 •

edited

Loading

pytorch-bot bot commented May 1, 2024 •

edited

Loading

weifengpy Jun 4, 2024 •

edited

Loading

weifengpy Jun 4, 2024

ebsmothers Jun 4, 2024

weifengpy Jun 4, 2024 •

edited

Loading

ebsmothers Jun 5, 2024

weifengpy Jun 4, 2024

weifengpy Jun 4, 2024 •

edited

Loading

ebsmothers Jun 4, 2024

weifengpy Jun 5, 2024

weifengpy Jun 4, 2024 •

edited

Loading

ebsmothers Jun 4, 2024

weifengpy Jun 4, 2024

ebsmothers Jun 4, 2024

ebsmothers Jun 4, 2024

ebsmothers Jun 4, 2024

weifengpy Jun 4, 2024

ebsmothers Jun 4, 2024

weifengpy Jun 4, 2024

weifengpy Jun 5, 2024 •

edited

Loading

ebsmothers left a comment

weifengpy commented Jun 5, 2024

		@@ -0,0 +1,91 @@
		# Config for single device QLoRA with lora_finetune_single_device.py

enable QLoRA + FSDP2 #909

enable QLoRA + FSDP2 #909

Conversation

weifengpy commented May 1, 2024 • edited Loading

pytorch-bot bot commented May 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/909

✅ No Failures

weifengpy Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

weifengpy commented Jun 5, 2024

weifengpy commented May 1, 2024 •

edited

Loading

pytorch-bot bot commented May 1, 2024 •

edited

Loading

weifengpy Jun 4, 2024 •

edited

Loading

weifengpy Jun 4, 2024 •

edited

Loading

weifengpy Jun 4, 2024 •

edited

Loading

weifengpy Jun 4, 2024 •

edited

Loading

weifengpy Jun 5, 2024 •

edited

Loading