Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors #27799

Merged
merged 2 commits into from
Dec 4, 2023

Conversation

yeounoh
Copy link
Contributor

@yeounoh yeounoh commented Dec 1, 2023

What does this PR do?

This re-enables checkpointing model on xla device, since that wasn't allowed in some cases where XLATensors have invalid storage.data_ptr(). This explicitly moves the model to cpu before checkpointing to ensure the storage is valid.

This is a hot-fix to unblock all failing HF model checkpointing on XLA device.

This fixes the existing _save_tpu in trainer.py implementation, which is currently broken:

100%|███████████████████████████████████████████| 10/10 [00:44<00:00,  4.49s/it]
[INFO|trainer.py:2821] 2023-12-01 23:04:53,098 >> Saving model checkpoint to /workspace/MNLI
[INFO|configuration_utils.py:458] 2023-12-01 23:05:09,794 >> Configuration saved in /workspace/MNLI/config.json
/usr/local/lib/python3.8/site-packages/safetensors-0.4.1rc1-py3.8-linux-x86_64.egg/safetensors/torch.py:17: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return tensor.storage().data_ptr()
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/safetensors-0.4.1rc1-py3.8-linux-x86_64.egg/safetensors/torch.py", line 13, in storage_ptr
    return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

With this change, it works,

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  2.28it/s][INFO|trainer.py:1975] 2023-12-01 23:44:53,563 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 7.1392, 'train_samples_per_second': 1.121, 'train_steps_per_second': 0.14, 'train_loss': 1.0944234132766724, 'epoch': 1.0}
100%|█████████████████████████████████████████████| 1/1 [00:07<00:00,  7.14s/it]
[INFO|trainer.py:2821] 2023-12-01 23:45:00,259 >> Saving model checkpoint to /workspace/MNLI
[INFO|configuration_utils.py:458] 2023-12-01 23:45:18,980 >> Configuration saved in /workspace/MNLI/config.json
[INFO|modeling_utils.py:1851] 2023-12-01 23:45:20,612 >> Model weights saved in /workspace/MNLI/pytorch_model.bin
[INFO|tokenization_utils_base.py:2210] 2023-12-01 23:45:20,613 >> tokenizer config file saved in /workspace/MNLI/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-12-01 23:45:20,613 >> Special tokens file saved in /workspace/MNLI/special_tokens_map.json

@yeounoh yeounoh changed the title [XLA] Re-enable broken _tpu_save for XLATensors [Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors Dec 1, 2023
@yeounoh
Copy link
Contributor Author

yeounoh commented Dec 2, 2023

I see this in the CI test run, tests_torch

 if cls.token is None:
>           raise ValueError("Cannot run tests as secret isn't setup.")
E           ValueError: Cannot run tests as secret isn't setup.

@yeounoh
Copy link
Contributor Author

yeounoh commented Dec 2, 2023

Also, run_tests is blocked, with "This workflow is awaiting approval from a maintainer in #27799"

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ArthurZucker
Copy link
Collaborator

No worries, the test is skipped on main feel free to rebase!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch thanks for the hot fix.

@ArthurZucker
Copy link
Collaborator

There was an issue with this recently #27578 that should be fixed by this I think

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, cc @muellerzr @pacman100 for visibility

Copy link

@Asap7772 Asap7772 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having issues with checkpointing when training on TPU and I suspect this may be the cause.

@@ -2831,18 +2831,20 @@ def _save_tpu(self, output_dir: Optional[str] = None):
xm.rendezvous("saving_checkpoint")
if not isinstance(self.model, PreTrainedModel):
if isinstance(unwrap_model(self.model), PreTrainedModel):
unwrap_model(self.model).save_pretrained(
unwrap_model(self.model).to("cpu").save_pretrained(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this change causes the model to move to cpu as to is an inplace operation. It is necessary to move the model back to the tpu/xla.

Copy link

@Asap7772 Asap7772 Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example in the context of accelerate, I had to run prepare after the to('cpu')

        if not isinstance(self.model, PreTrainedModel):
            if isinstance(unwrap_model(self.model), PreTrainedModel):
                unwrap_model(self.model).to("cpu").save_pretrained(
                    output_dir,
                    is_main_process=self.args.should_save,
                    state_dict=self.model.state_dict(),
                    save_function=xm.save,
                )
                self.model = self.accelerator.prepare(self.model)
            else:
                logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
                state_dict = self.model.state_dict().to("cpu")
                xm.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
        else:
            self.model.to("cpu").save_pretrained(
                output_dir, is_main_process=self.args.should_save, save_function=xm.save
            )
            self.model = self.accelerator.prepare(self.model)

Copy link

@Asap7772 Asap7772 Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this may be needed to be handled separately for different backends 😄 .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Asap7772, a recent PR was merged in to address this: #27993

Could you try installing from source to see if resolves the issue for you? If not, could you raise a new github issue?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll take a look.

jeffhataws added a commit to jeffhataws/transformers that referenced this pull request Mar 17, 2024
save_safetensor=True is default as of release 4.35.0, which then
required TPU hotfix huggingface#27799
(issue huggingface#27578).
However, when the flag save_safetensor is set to False (compatibility mode),
moving the model to CPU causes generation of too many graphs
during checkpoint huggingface#28438.
This PR disable moving of model to CPU when save_safetensor=False.
jeffhataws added a commit to jeffhataws/transformers that referenced this pull request Mar 17, 2024
save_safetensor=True is default as of release 4.35.0, which then
required TPU hotfix huggingface#27799
(issue huggingface#27578).
However, when the flag save_safetensor is set to False (compatibility mode),
moving the model to CPU causes generation of too many graphs
during checkpoint huggingface#28438.
This PR disable moving of model to CPU when save_safetensor=False.
@jeffhataws
Copy link
Contributor

@yeounoh @ArthurZucker please help check if #29703 would work for TPU?

amyeroberts pushed a commit that referenced this pull request Apr 24, 2024
)

save_safetensor=True is default as of release 4.35.0, which then
required TPU hotfix #27799
(issue #27578).
However, when the flag save_safetensor is set to False (compatibility mode),
moving the model to CPU causes generation of too many graphs
during checkpoint #28438.
This PR disable moving of model to CPU when save_safetensor=False.
itazap pushed a commit that referenced this pull request May 14, 2024
)

save_safetensor=True is default as of release 4.35.0, which then
required TPU hotfix #27799
(issue #27578).
However, when the flag save_safetensor is set to False (compatibility mode),
moving the model to CPU causes generation of too many graphs
during checkpoint #28438.
This PR disable moving of model to CPU when save_safetensor=False.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants