Avoid graph break by removing redundant requires_grad attr change (#7158)

deepcharm · loadams · hwchen2017 · loadams · commit 4a851f3560d2 · 2025-03-25T08:51:17.000-07:00
This PR is a continuation of the efforts to improve DeepSpeed
performance when using PyTorch compile.

Dynamo breaks the graph because `flat_tensor.requires_grad = False`:

* Is a side-effecting operation on tensor metadata
* Occurs in a context where Dynamo expects static tensor properties for
tracing

`flat_tensor.requires_grad` is redundant and can be safely removed
because:
* `_allgather_params()` function is already decorated with
`@torch.no_grad()` which ensures the desired property
* `flat_tensor` is created using the `torch.empty()` which sets the
`requires_grad=False` by default.

---------

Signed-off-by: Max Kovalenko &lt;mkovalenko@habana.ai&gt;
Co-authored-by: Logan Adams &lt;114770087+loadams@users.noreply.github.com&gt;
Co-authored-by: Hongwei Chen &lt;33092912+hwchen2017@users.noreply.github.com&gt;
Signed-off-by: Logan Adams &lt;loadams@microsoft.com&gt;
diff --git a/deepspeed/runtime/zero/partition_parameters.py b/deepspeed/runtime/zero/partition_parameters.py
@@ -1899,7 +1899,6 @@ def _allgather_params(self, param_list, hierarchy=0):
 
         tensor_size = partition_size * self.num_partitions
         flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype, device=self.local_device)
-        flat_tensor.requires_grad = False
         partitions = []
         for i in range(self.num_partitions):
             start = partition_size * i