Use checkpoint to enforece the recomputation of fp8 weight (pytorch#936)

Summary: Pull Request resolved: pytorch#936 The issue: When using float8 training with FSDP, we have these tensors in the forward_backward graph: - Without fp8-all-gather: original_weight (all-gather output, sharded) - fp8_weight - fp8_weight_transpose (needed in backward) - With fp8-all-gather: original_weight (sharded) - fp8_weight (all-gather output, sharded) - fp8_weight_transpose (needed in backward) `torch.compile` decides how to partition the graph and which tensors to save for backward. In both the case of with and without fp8-all-gather, it decides to save "fp8_weight_transpose" for backward. It's good in single GPU case, and compute both fp8_weight and fp_weight_transpose in forawrd can be fused into one kernel. However, if we use FSDP to shard the weights, although the weight itself is sharded, the "fp8_weight_transpose" tensors are not. Saving it for backward costs a high memory utilization. ---- To fix it, we have different options: - In the user code, enforce which tensors to save for backward - The `save_for_backward` in custom autograd.Function is one way to specify which tensors to save. However, torch.compile will ignore what are manually saved for backward in a custom autograd.Function, and just run the partitioner. - **[This pr]** Using "torch.utils.checkpoint", which is the API that compile does promise to respect today. It would instruct compile to only save its inputs for backward (the weight and activation), and not the intermediate values from the float8 cast. - Rely on torch.compile to find the best partition that optimizes both computation and memory. It may be a very longer-term solution to fix in compile. Differential Revision: D63345959
y-sq · Sep 24, 2024 · 4a60a29 · 4a60a29
1 parent 728d629
commit 4a60a29
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 2 deletions.
diff --git a/torchao/float8/config.py b/torchao/float8/config.py
@@ -132,6 +132,12 @@ class Float8LinearConfig:
     # configuration, this field may move to per-tensor configs.
     delayed_scaling_config: DelayedScalingConfig = DelayedScalingConfig()
 
+    # If True, fp8_weight will always be re-computed in backward. 
+    # If False, fp8_weight from forward may be saved for backward.
+    # It's recommended to enable this flag when using FSDP. 
+    # Otherwise, the entire fp8_weight, instead of the sharded weight may be saved.
+    recompute_fp8_weight_in_bwd: bool = False
+
 
 # If True, use 'fnuz' float8 types for calculations.
 # Currently, ROCm only supports fnuz variants.

diff --git a/torchao/float8/float8_linear.py b/torchao/float8/float8_linear.py
@@ -41,6 +41,8 @@
     WeightWithStaticFloat8CastTensor,
 )
 
+import torch.utils.checkpoint as checkpoint
+
 
 # this code was resurrected from https://github.com/pytorch-labs/torchao.float8/pull/128/files
 @torch._dynamo.allow_in_graph
@@ -180,6 +182,8 @@ def __init__(self, *args, **kwargs):
         # would be initialized in every iteration.
         self.enable_pre_and_post_forward = self.config.enable_pre_and_post_forward
 
+        self.recompute_fp8_weight_in_bwd = self.config.recompute_fp8_weight_in_bwd
+
     def create_buffers(self):
         # Default values for history buffers, see above TODO
         history_len = self.config.delayed_scaling_config.history_len
@@ -390,15 +394,22 @@ def float8_post_forward(self):
         # amaxes and scales
         self.is_amax_initialized = True
         self.amax_and_scale_synced = False
+
+    def cast_weight_and_matmul(self, input_fp8):
+        weight_fp8 = self.cast_weight_to_float8(self.weight, self.is_amax_initialized)
+        output = manual_float8_matmul.apply(input_fp8, weight_fp8.t())
+        return output
 
     def forward(self, input: torch.Tensor) -> torch.Tensor:
         if self.has_any_delayed_scaling:
             self.float8_pre_forward(input)
 
         input_fp8 = self.cast_input_to_float8(input, self.is_amax_initialized)
-        weight_fp8 = self.cast_weight_to_float8(self.weight, self.is_amax_initialized)
 
-        output = manual_float8_matmul.apply(input_fp8, weight_fp8.t())
+        if self.recompute_fp8_weight_in_bwd:
+            output = checkpoint.checkpoint(self.cast_weight_and_matmul, input_fp8)
+        else:
+            output = self.cast_weight_and_matmul(input_fp8)
 
         # Cast grad_output to float8_e5m2 during backward
         output = self.cast_output_to_float8_in_bw(output)