Neuron: When save_safetensor=False, no need to move model to CPU

save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix huggingface#27799 (issue huggingface#27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint huggingface#28438. This PR disable moving of model to CPU when save_safetensor=False.
jeffhataws · Mar 17, 2024 · 8a5ab0b · 8a5ab0b
1 parent 00c1d87
commit 8a5ab0b
Showing 1 changed file with 4 additions and 2 deletions.
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
@@ -3013,7 +3013,8 @@ def _save_tpu(self, output_dir: Optional[str] = None):
         logger.info(f"Saving model checkpoint to {output_dir}")
         model = self.model
         xm.mark_step()
-        model.to("cpu")
+        if self.args.save_safetensors:
+            model.to("cpu")
 
         if xm.is_master_ordinal():
             os.makedirs(output_dir, exist_ok=True)
@@ -3048,7 +3049,8 @@ def _save_tpu(self, output_dir: Optional[str] = None):
 
         # We moved the model from TPU -> CPU for saving the weights.
         # Now we should move it back to subsequent compute still works.
-        model.to(self.args.device)
+        if self.args.save_safetensors:
+            model.to(self.args.device)
 
     def _save(self, output_dir: Optional[str] = None, state_dict=None):
         # If we are executing this function, we are the process zero, so we don't check for that.