-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors #27799
Conversation
0e18b25
to
9645ad5
Compare
I see this in the CI test run,
|
Also, |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
No worries, the test is skipped on main feel free to rebase! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch thanks for the hot fix.
There was an issue with this recently #27578 that should be fixed by this I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, cc @muellerzr @pacman100 for visibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having issues with checkpointing when training on TPU and I suspect this may be the cause.
@@ -2831,18 +2831,20 @@ def _save_tpu(self, output_dir: Optional[str] = None): | |||
xm.rendezvous("saving_checkpoint") | |||
if not isinstance(self.model, PreTrainedModel): | |||
if isinstance(unwrap_model(self.model), PreTrainedModel): | |||
unwrap_model(self.model).save_pretrained( | |||
unwrap_model(self.model).to("cpu").save_pretrained( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this change causes the model to move to cpu as to is an inplace operation. It is necessary to move the model back to the tpu/xla.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example in the context of accelerate, I had to run prepare after the to('cpu')
if not isinstance(self.model, PreTrainedModel):
if isinstance(unwrap_model(self.model), PreTrainedModel):
unwrap_model(self.model).to("cpu").save_pretrained(
output_dir,
is_main_process=self.args.should_save,
state_dict=self.model.state_dict(),
save_function=xm.save,
)
self.model = self.accelerator.prepare(self.model)
else:
logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
state_dict = self.model.state_dict().to("cpu")
xm.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
else:
self.model.to("cpu").save_pretrained(
output_dir, is_main_process=self.args.should_save, save_function=xm.save
)
self.model = self.accelerator.prepare(self.model)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this may be needed to be handled separately for different backends 😄 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll take a look.
save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix huggingface#27799 (issue huggingface#27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint huggingface#28438. This PR disable moving of model to CPU when save_safetensor=False.
save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix huggingface#27799 (issue huggingface#27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint huggingface#28438. This PR disable moving of model to CPU when save_safetensor=False.
@yeounoh @ArthurZucker please help check if #29703 would work for TPU? |
) save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix #27799 (issue #27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint #28438. This PR disable moving of model to CPU when save_safetensor=False.
) save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix #27799 (issue #27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint #28438. This PR disable moving of model to CPU when save_safetensor=False.
What does this PR do?
This re-enables checkpointing model on
xla
device, since that wasn't allowed in some cases where XLATensors have invalid storage.data_ptr(). This explicitly moves the model tocpu
before checkpointing to ensure the storage is valid.This is a hot-fix to unblock all failing HF model checkpointing on XLA device.
This fixes the existing
_save_tpu
intrainer.py
implementation, which is currently broken:With this change, it works,