add FSDP QLoRA test and revert failing PR #403

weifengpy · 2024-06-19T03:02:00Z

fix error when running torchtune QLoRA + FSDP2 #380
TypeError: nf4_detach() missing 1 required positional argument: 'args'

torchtune command

tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token <HF_TOKEN>
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_fsdp2 --config llama2/7B_qlora enable_activation_checkpointing=False

revert NF4 changes from Factor out dispatch and layout registration table #360
FSDP2 + QLoRA multi-gpu test: pytest -s test/dtypes/test_nf4.py -k test_qlora: e2e fsdp2 + qlora test
NF4.clone test pytest -s test/dtypes/test_nf4.py -k test_tensor_copy: torchtune implemented NF4.clone, upstream it to TorchAO. This is needed by unit test copy.deepcopy(model)

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-06-19T03:02:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/403

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a9f6cca with merge base 6b0ca2d ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://download.pytorc... / linux-job (gh) (trunk failure)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-06-19T22:06:08Z

torchao/dtypes/nf4tensor.py

@@ -11,10 +11,6 @@
 from torch import Tensor
 from torch.distributed.device_mesh import DeviceMesh
 from torch._prims_common import make_contiguous_strides_for
-from torchao.dtypes.utils import (


#360 consolidate _implements and _ATEN_OP_OR_TORCH_FN_TABLE but it breaks torchtune. revert for now to unblock torchtune quickly

Do you know how exactly this breaks torchtune? is it a versioning thing between saved models and this new model?

the error is TypeError: nf4_detach() missing 1 required positional argument: 'args'. So there is something incompatiable around _ATEN_OP_OR_TORCH_FN_TABLE[func](*args, **kwargs)

the errors shows up when people start training in TorchTune for the 1st time

@jerryzh168 any thoughts on why this is happening, otherwise are you okay to undo your changes?

@drisspg to add a bit more to what @weifengpy already said, the full stack trace is here.

I think there is something weird going on with the torch_function dispatch, where the args are becoming the arg name for aten.. If someone wants to track down how this should work and why this is I am for it but otherwhise I will approve to unblock

weifengpy · 2024-06-19T22:06:46Z

test/dtypes/test_nf4.py

+    def test_qlora_fsdp2(self):
+        from torch.distributed._composable.fsdp import CPUOffloadPolicy, OffloadPolicy
+
+        self.run_subtests(


e2e mutli-gpu FSDP + QLoRA test should be able to catch regression in the future

* add FSDP QLoRA test and revert failing PR Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * check pytorch version and cuda for ci Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * revert linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

add FSDP QLoRA test and revert failing PR

50cf76f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 19, 2024

weifengpy marked this pull request as draft June 19, 2024 03:02

msaroufim requested review from jerryzh168 and drisspg June 19, 2024 03:08

weifengpy added 2 commits June 19, 2024 13:35

check pytorch version and cuda for ci

1ee3f40

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

revert linter

a9f6cca

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy marked this pull request as ready for review June 19, 2024 22:02

weifengpy commented Jun 19, 2024

View reviewed changes

drisspg approved these changes Jun 21, 2024

View reviewed changes

weifengpy merged commit 2eb08be into pytorch:main Jun 21, 2024
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add FSDP QLoRA test and revert failing PR #403

add FSDP QLoRA test and revert failing PR #403

weifengpy commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

weifengpy Jun 19, 2024

drisspg Jun 20, 2024

weifengpy Jun 20, 2024 •

edited

Loading

drisspg Jun 21, 2024

ebsmothers Jun 21, 2024

drisspg Jun 21, 2024

weifengpy Jun 19, 2024

add FSDP QLoRA test and revert failing PR #403

add FSDP QLoRA test and revert failing PR #403

Conversation

weifengpy commented Jun 19, 2024 • edited Loading

pytorch-bot bot commented Jun 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/403

✅ You can merge normally! (1 Unrelated Failure)

weifengpy Jun 19, 2024

Choose a reason for hiding this comment

drisspg Jun 20, 2024

Choose a reason for hiding this comment

weifengpy Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

drisspg Jun 21, 2024

Choose a reason for hiding this comment

ebsmothers Jun 21, 2024

Choose a reason for hiding this comment

drisspg Jun 21, 2024

Choose a reason for hiding this comment

weifengpy Jun 19, 2024

Choose a reason for hiding this comment

weifengpy commented Jun 19, 2024 •

edited

Loading

pytorch-bot bot commented Jun 19, 2024 •

edited

Loading

weifengpy Jun 20, 2024 •

edited

Loading