Use checkpoint to enforece the recomputation of fp8 weight #936

y-sq · 2024-09-24T22:02:59Z

Summary:
The issue:
When using float8 training with FSDP, we have these tensors in the forward_backward graph:

Without fp8-all-gather:
original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward)
With fp8-all-gather:
original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward)

torch.compile decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization.

To fix it, we have different options:

In the user code, enforce which tensors to save for backward
- The save_for_backward in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner.
- [This pr] Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast.
Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile.

Differential Revision: D63345959

pytorch-bot · 2024-09-24T22:03:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/936

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 40584c8 with merge base 9229df9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-09-24T22:03:16Z

This pull request was exported from Phabricator. Differential Revision: D63345959

facebook-github-bot · 2024-09-24T22:23:49Z

This pull request was exported from Phabricator. Differential Revision: D63345959

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Differential Revision: D63345959

weifengpy · 2024-09-24T22:49:49Z

cc @H-Huang @tianyu-l as this might be related with PP + float8 memory increase in torch.compile

facebook-github-bot · 2024-09-25T06:36:54Z

This pull request was exported from Phabricator. Differential Revision: D63345959

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Differential Revision: D63345959

facebook-github-bot · 2024-09-25T06:46:32Z

This pull request was exported from Phabricator. Differential Revision: D63345959

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Differential Revision: D63345959

facebook-github-bot · 2024-09-25T18:39:06Z

This pull request was exported from Phabricator. Differential Revision: D63345959

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Differential Revision: D63345959

facebook-github-bot · 2024-09-25T21:44:51Z

This pull request was exported from Phabricator. Differential Revision: D63345959

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Differential Revision: D63345959

facebook-github-bot · 2024-09-25T21:52:29Z

This pull request was exported from Phabricator. Differential Revision: D63345959

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Differential Revision: D63345959

facebook-github-bot · 2024-09-25T21:57:40Z

This pull request was exported from Phabricator. Differential Revision: D63345959

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Differential Revision: D63345959

vkuzo

thank you! left a couple of nit comments on the internal version

soulitzer · 2024-09-27T19:10:48Z

torchao/float8/float8_linear.py

+        weight_scale = self.get_weight_scale(self.weight)
+
+        if self.force_recompute_fp8_weight_in_bwd:
+            weight_fp8_t = checkpoint.checkpoint(


Will this force_recompute_fp8_weight_in_bwd flag help ensure that the checkpoint is only done in compile?

In eager if someone has an outer AC context, the semantics are to do recursive checkpointing, i.e., a single tensor is computed multiple times. Nesting AC within SAC also hasn't been tested.

In the case of compile, the behavior is not really defined, but it might be possible that it would behave as you want - e.g. nested SAC policies where the inner policy overrode the outer policy, but I have not tested it.

But also we should add a note here that even though this might work today, in the future we'd want to replace it with some nicer API.

Thanks for the comment.
The most use case of float8 is with compile, because the eager performance is very bad.
Maybe I can add a warning that if "force_recompute_fp8_weight_in_bwd" is enabled, it's recommended to rely on torch.compile for activation checkpointing. Otherwise, if the users want to use customized AC, they should be sure to handle the checkpointing of weights themselves?
And I'll also add the note of "that even though this might work today, in the future we'd want to replace it with some nicer API." Shall we also open an issue to track the longer-term solution?

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Reviewed By: vkuzo Differential Revision: D63345959

facebook-github-bot · 2024-10-02T07:08:54Z

This pull request was exported from Phabricator. Differential Revision: D63345959

Differential Revision: D63345959 Pull Request resolved: pytorch#936

…orch#936)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 24, 2024

facebook-github-bot added the fb-exported label Sep 24, 2024

y-sq force-pushed the export-D63345959 branch from 2aee093 to 4a60a29 Compare September 24, 2024 22:23

y-sq force-pushed the export-D63345959 branch from 4a60a29 to caacec5 Compare September 25, 2024 06:36

y-sq force-pushed the export-D63345959 branch from caacec5 to 135e81b Compare September 25, 2024 06:46

y-sq force-pushed the export-D63345959 branch from 135e81b to d4ad4ba Compare September 25, 2024 18:39

y-sq force-pushed the export-D63345959 branch from d4ad4ba to 82efc6e Compare September 25, 2024 21:44

y-sq force-pushed the export-D63345959 branch from 82efc6e to 056c45b Compare September 25, 2024 21:52

y-sq force-pushed the export-D63345959 branch from 056c45b to 0c14069 Compare September 25, 2024 21:57

y-sq requested review from vkuzo, bdhirsh, drisspg and soulitzer September 25, 2024 23:24

vkuzo approved these changes Sep 27, 2024

View reviewed changes

soulitzer reviewed Sep 27, 2024

View reviewed changes

y-sq force-pushed the export-D63345959 branch from 0c14069 to 40584c8 Compare October 2, 2024 07:08

drisspg approved these changes Oct 2, 2024

View reviewed changes

facebook-github-bot merged commit e919558 into pytorch:main Oct 2, 2024
17 of 19 checks passed

melvinebenezer pushed a commit to melvinebenezer/ao that referenced this pull request Oct 7, 2024

Use checkpoint to enforece the recomputation of fp8 weight

493a2c3

Differential Revision: D63345959 Pull Request resolved: pytorch#936

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

README: Subcommand presentation and improved Server instructions (pyt…

3bd219a

…orch#936)

vkuzo mentioned this pull request Jan 15, 2025

float8nocompile: integrate fused col major transpose and non-tranpose kernel #1567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use checkpoint to enforece the recomputation of fp8 weight #936

Use checkpoint to enforece the recomputation of fp8 weight #936

Uh oh!

y-sq commented Sep 24, 2024

Uh oh!

pytorch-bot bot commented Sep 24, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 24, 2024

Uh oh!

facebook-github-bot commented Sep 24, 2024

Uh oh!

weifengpy commented Sep 24, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

vkuzo left a comment

Uh oh!

soulitzer Sep 27, 2024

Uh oh!

y-sq Oct 1, 2024

Uh oh!

facebook-github-bot commented Oct 2, 2024

Uh oh!

Uh oh!

Uh oh!

Use checkpoint to enforece the recomputation of fp8 weight #936

Use checkpoint to enforece the recomputation of fp8 weight #936

Uh oh!

Conversation

y-sq commented Sep 24, 2024

Uh oh!

pytorch-bot bot commented Sep 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/936

✅ No Failures

Uh oh!

facebook-github-bot commented Sep 24, 2024

Uh oh!

facebook-github-bot commented Sep 24, 2024

Uh oh!

weifengpy commented Sep 24, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

soulitzer Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

y-sq Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 2, 2024

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 24, 2024 •

edited

Loading