Fix precision + QLoRA state dict tests, DTensor init #1087

ebsmothers · 2024-06-13T16:03:04Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

After #855 added usage of FSDPTest we cannot set torch.backends.cudnn in the same test env as an FSDPTest. The fix it to set torch.backends.__allow_nonbracketed_mutation_flag to True inside the test, similar to what was done in #855. Unfortunately we missed this test case because our unit tests that require GPUs do not currently run in CI.

Also test_qlora_state_dict is failing as it requires a newer version of torchao. Current torchao does not have __version__ defined, so we add the wonderful check "__version__" not in dir(torchao). Separately, there is an issue where the DTensor API changed in a BC-breaking way. I bisected to the appropriate nightly and for now I gate based on that torch version. Once 2.4 releases we should be able to remove this check.

Oh also the way we were checking version("torchao") in _register_nf4_dispatch_ops.py doesn't really account for the possibility of having torchao-nightly installed (increasingly prevalent lately..). So I updated that bit of code to account for possible nightly torchao install. Once torchao has a version and we are on 0.3 all of this can go.

Follow-ups:

Test GPU unit tests in our CI (there are quite a few of them and rn we have no signal on these, many of which are for more bleeding-edge features)
Define proper version-based gating across our various dependencies and apply it consistently across the repo

Test plan

Prior to these changes:

pytest tests/
...
FAILED tests/torchtune/utils/test_distributed.py::TestFullyShardState::test_qlora_state_dict - RuntimeError: Process 0 exited with error code 10 and exception:
FAILED tests/torchtune/utils/test_precision.py::TestPrecisionUtils::test_set_float32_precision - RuntimeError: not allowed to set torch.backends.cudnn flags after disable_global_flags; please use flags() context manager instead
============== 2 failed, 242 passed, 34 skipped, 2 warnings in 55.96s ======================

After these changes (and on a nightly version of torchao)

pytest tests/
...
===== 244 passed, 34 skipped, 2 warnings in 63.99s (0:01:03) ========

pytorch-bot · 2024-06-13T16:03:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1087

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

inductor_torchbench_perf jobs are broken due to numpy 2.0 update

✅ No Failures

As of commit 69a372d with merge base abe798d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pbontrager

Good catch, just one comment and then good to go.

tests/torchtune/utils/test_precision.py

felipemello1 · 2024-06-13T16:16:31Z

tests/torchtune/utils/test_precision.py

        _set_float32_precision("highest")
        assert torch.get_float32_matmul_precision() == "highest"
        assert not torch.backends.cudnn.allow_tf32
        assert not torch.backends.cuda.matmul.allow_tf32

        _set_float32_precision("high")
+        setattr(  # noqa: B010
+            torch.backends, "__allow_nonbracketed_mutation_flag", True


is there a reason for adding "setattr" a second time and its placement here? Or doing it once at the start is enough?

Sorry second one should be False. Will fix

ebsmothers · 2024-06-14T01:28:15Z

torchtune/utils/_distributed.py

-                stride=sharded_meta_param.stride(),
-            )
+            # BC-breaking change to DTensor API in https://github.com/pytorch/pytorch/pull/128112
+            if version.parse(torch.__version__) >= version.parse("2.4.0.dev20240606"):


@wanchaol @weifengpy please let me know if this is the best way to handle this, also open to any other ideas either of you have here.

Also regarding a longer-term plan here, are there any plans to make DTensor a public API? Or should we consider whether there's an alternative way to do this without directly calling into the DTensor API?

chatted about this. the long-term solution is to use public DTensor api

return DTensor.from_local( local_tensor, sharding_spec.mesh, sharding_spec.placements, shape=sharding_spec.shape, stride=sharding_spec.stride, )

since this is state_dict logic, I think we should just directly use DTensor.from_local? DTensor.__new__ is kinda private API of DTensor.

We are going to make DTensor API be public but probably not DTensor.__new__

One more question here: if using with NF4, does it mean we will have to also add logic for aten.view in ao (based on this line)?

good catch. i think so. do you know the args, kwargs after dispatching? if view_as(shape) matches NF4.shape, that’s trivial. otherwise we need a deeper look because the semantics of NF4.view_as(different shape) is not trivial

btw, it does not have to be addressed in this PR, if it needs torchao change

Yeah personally I don't know the args or kwargs so need to do a bit more investigation. Will leave this as a fast follow for now

I just found pytorch from source does not work. torch.__version__ is 2.4.0a0+git..... My commit hash is before DTensor BC breaking, but It goes into DTensor(spec) and complained cc @ebsmothers

joecummings

🫡🫡🫡

joecummings · 2024-06-17T15:34:01Z

You can rebase on main now that #1095 has been merged and the workflows should pass.

Fix precision test

cbf658b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2024

ebsmothers requested a review from felipemello1 June 13, 2024 16:03

pbontrager reviewed Jun 13, 2024

View reviewed changes

tests/torchtune/utils/test_precision.py Show resolved Hide resolved

felipemello1 reviewed Jun 13, 2024

View reviewed changes

fix dtensor initialization, update flag

3d8e8d4

ebsmothers changed the title ~~Fix precision test~~ Fix precision test and DTensor init Jun 13, 2024

ebsmothers requested a review from weifengpy June 13, 2024 17:58

ebsmothers changed the title ~~Fix precision test and DTensor init~~ Fix precision + QLoRA state dict tests, DTensor init Jun 13, 2024

ebsmothers added 4 commits June 13, 2024 11:22

add torchao version check

0b6ec2d

wip changes

2da3884

Merge branch 'main' into fix-precision-test

dc5344d

various cleanup

9835f9b

ebsmothers commented Jun 14, 2024

View reviewed changes

add todo to change dtensor api

ad9ee0a

joecummings approved these changes Jun 17, 2024

View reviewed changes

ebsmothers mentioned this pull request Jun 17, 2024

Add aten.view_as override for NF4 tensor pytorch/ao#387

Open

Merge branch 'main' into fix-precision-test

69a372d

ebsmothers merged commit 66d1a9c into pytorch:main Jun 18, 2024
29 checks passed

ebsmothers mentioned this pull request Jul 9, 2024

Miscellaneous CI, dependency, and version fixes #1151

Merged

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024

Fix precision + QLoRA state dict tests, DTensor init (pytorch#1087)

0a12ea0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix precision + QLoRA state dict tests, DTensor init #1087

Fix precision + QLoRA state dict tests, DTensor init #1087

ebsmothers commented Jun 13, 2024 •

edited

Loading

pytorch-bot bot commented Jun 13, 2024 •

edited

Loading

pbontrager left a comment

felipemello1 Jun 13, 2024

ebsmothers Jun 13, 2024

ebsmothers Jun 14, 2024

ebsmothers Jun 14, 2024

weifengpy Jun 14, 2024

wanchaol Jun 14, 2024 •

edited

Loading

ebsmothers Jun 15, 2024

weifengpy Jun 15, 2024 •

edited

Loading

weifengpy Jun 15, 2024

ebsmothers Jun 15, 2024

weifengpy Jun 18, 2024

joecummings left a comment

joecummings commented Jun 17, 2024

Fix precision + QLoRA state dict tests, DTensor init #1087

Fix precision + QLoRA state dict tests, DTensor init #1087

Conversation

ebsmothers commented Jun 13, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Jun 13, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1087

❗ 1 Active SEVs

✅ No Failures

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wanchaol Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

joecummings commented Jun 17, 2024

ebsmothers commented Jun 13, 2024 •

edited

Loading

pytorch-bot bot commented Jun 13, 2024 •

edited

Loading

wanchaol Jun 14, 2024 •

edited

Loading

weifengpy Jun 15, 2024 •

edited

Loading