Add an option to use fp8-all-gather only without fp8 computation. #1093

y-sq · 2024-10-16T17:56:39Z

Summary:
The implementation reuses WeightWithDynamicFloat8CastTensor class and the Float8Linear module.

I added an if-else branch in the existing Float8Linear module to re-use our existing logics to handle different casting cases, such as pre-/post-forward for delayed scaling, pre-compute amax for fp8-all-gather.

Differential Revision: D63056142

pytorch-bot · 2024-10-16T17:56:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1093

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5c5f4f2 with merge base ae77f40 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-16T17:56:47Z

This pull request was exported from Phabricator. Differential Revision: D63056142

weifengpy · 2024-10-21T05:29:17Z

CI error looks relevant. maybe take a look before landing?

  ImportError while importing test module '/pytorch/ao/test/float8/test_fsdp2/test_fsdp2_fp8_comm_only.py'.
  Hint: make sure your test modules/packages have valid Python names.
  Traceback:
  /opt/conda/envs/venv/lib/python3.9/importlib/__init__.py:127: in import_module
      return _bootstrap._gcd_import(name[level:], package, level)
  test/float8/test_fsdp2/test_fsdp2_fp8_comm_only.py:14: in <module>
      from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy
  E   ModuleNotFoundError: No module named 'torch.distributed._composable.fsdp'```

…torch#1093) Summary: The implementation reuses `WeightWithDynamicFloat8CastTensor` class and the `Float8Linear` module. I added an if-else branch in the existing `Float8Linear` module to re-use our existing logics to handle different casting cases, such as pre-/post-forward for delayed scaling, pre-compute amax for fp8-all-gather. Reviewed By: weifengpy Differential Revision: D63056142

facebook-github-bot · 2024-10-31T05:48:34Z

This pull request was exported from Phabricator. Differential Revision: D63056142

…torch#1093) Summary: The implementation reuses `WeightWithDynamicFloat8CastTensor` class and the `Float8Linear` module. I added an if-else branch in the existing `Float8Linear` module to re-use our existing logics to handle different casting cases, such as pre-/post-forward for delayed scaling, pre-compute amax for fp8-all-gather. Reviewed By: weifengpy Differential Revision: D63056142

facebook-github-bot · 2024-10-31T05:49:06Z

This pull request was exported from Phabricator. Differential Revision: D63056142

…torch#1093) Summary: The implementation reuses `WeightWithDynamicFloat8CastTensor` class and the `Float8Linear` module. I added an if-else branch in the existing `Float8Linear` module to re-use our existing logics to handle different casting cases, such as pre-/post-forward for delayed scaling, pre-compute amax for fp8-all-gather. Reviewed By: weifengpy Differential Revision: D63056142

facebook-github-bot · 2024-10-31T07:07:53Z

This pull request was exported from Phabricator. Differential Revision: D63056142

…training Summary: In #1093 we added a config option, off by default, to use only float8 all-gather for training and do the matrix multiply in high precision. This seems generally useful for communication bound workloads, but we can probably think of a cleaner way to add this functionality (such as a weight wrapper tensor subclass). The current implementation adds non-trivial complexity and doesn't jive well with where we want to take this codebase. Since no one is using this internally or externally yet and we haven't talked about it in the release notes, I think we should do a BC-breaking delete as a one-off. However, if people have concerns - let me know and we can talk about less aggressive options. Test Plan: ``` ./test/float8/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags:

#1451) for now, delete the float8-all-gather-only functionality from float8 training Summary: In #1093 we added a config option, off by default, to use only float8 all-gather for training and do the matrix multiply in high precision. This seems generally useful for communication bound workloads, but we can probably think of a cleaner way to add this functionality (such as a weight wrapper tensor subclass). The current implementation adds non-trivial complexity and doesn't jive well with where we want to take this codebase. Since no one is using this internally or externally yet and we haven't talked about it in the release notes, I think we should do a BC-breaking delete as a one-off. However, if people have concerns - let me know and we can talk about less aggressive options. Test Plan: ``` ./test/float8/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2024

facebook-github-bot added the fb-exported label Oct 16, 2024

y-sq requested review from vkuzo, weifengpy and drisspg October 16, 2024 17:57

weifengpy approved these changes Oct 21, 2024

View reviewed changes

y-sq force-pushed the export-D63056142 branch from 7857bb1 to 2950a2e Compare October 31, 2024 05:48

y-sq force-pushed the export-D63056142 branch from 2950a2e to 0f3ef23 Compare October 31, 2024 05:48

y-sq force-pushed the export-D63056142 branch from 0f3ef23 to 5c5f4f2 Compare October 31, 2024 07:07

facebook-github-bot merged commit 2761917 into pytorch:main Oct 31, 2024
17 of 19 checks passed

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

[easy] README: typo- missing where (pytorch#1093)

8ccf162

vkuzo mentioned this pull request Dec 19, 2024

for now, delete the float8-all-gather-only functionality from float8 … #1451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an option to use fp8-all-gather only without fp8 computation. #1093

Add an option to use fp8-all-gather only without fp8 computation. #1093

Uh oh!

y-sq commented Oct 16, 2024

Uh oh!

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Oct 16, 2024

Uh oh!

weifengpy commented Oct 21, 2024

Uh oh!

facebook-github-bot commented Oct 31, 2024

Uh oh!

facebook-github-bot commented Oct 31, 2024

Uh oh!

facebook-github-bot commented Oct 31, 2024

Uh oh!

Uh oh!

Uh oh!

Add an option to use fp8-all-gather only without fp8 computation. #1093

Add an option to use fp8-all-gather only without fp8 computation. #1093

Uh oh!

Conversation

y-sq commented Oct 16, 2024

Uh oh!

pytorch-bot bot commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1093

✅ No Failures

Uh oh!

facebook-github-bot commented Oct 16, 2024

Uh oh!

weifengpy commented Oct 21, 2024

Uh oh!

facebook-github-bot commented Oct 31, 2024

Uh oh!

facebook-github-bot commented Oct 31, 2024

Uh oh!

facebook-github-bot commented Oct 31, 2024

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading