Skip to content

Commit

Permalink
Neuron: When save_safetensor=False, no need to move model to CPU
Browse files Browse the repository at this point in the history
save_safetensor=True is default as of release 4.35.0, which then
required TPU hotfix huggingface#27799
(issue huggingface#27578).
However, when the flag save_safetensor is set to False (compatibility mode),
moving the model to CPU causes generation of too many graphs
during checkpoint huggingface#28438.
This PR disable moving of model to CPU when save_safetensor=False.
  • Loading branch information
jeffhataws committed Mar 17, 2024
1 parent 00c1d87 commit 8a5ab0b
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions src/transformers/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -3013,7 +3013,8 @@ def _save_tpu(self, output_dir: Optional[str] = None):
logger.info(f"Saving model checkpoint to {output_dir}")
model = self.model
xm.mark_step()
model.to("cpu")
if self.args.save_safetensors:
model.to("cpu")

if xm.is_master_ordinal():
os.makedirs(output_dir, exist_ok=True)
Expand Down Expand Up @@ -3048,7 +3049,8 @@ def _save_tpu(self, output_dir: Optional[str] = None):

# We moved the model from TPU -> CPU for saving the weights.
# Now we should move it back to subsequent compute still works.
model.to(self.args.device)
if self.args.save_safetensors:
model.to(self.args.device)

def _save(self, output_dir: Optional[str] = None, state_dict=None):
# If we are executing this function, we are the process zero, so we don't check for that.
Expand Down

0 comments on commit 8a5ab0b

Please sign in to comment.