[Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors #27799

yeounoh · 2023-12-01T23:45:56Z

What does this PR do?

This re-enables checkpointing model on xla device, since that wasn't allowed in some cases where XLATensors have invalid storage.data_ptr(). This explicitly moves the model to cpu before checkpointing to ensure the storage is valid.

This is a hot-fix to unblock all failing HF model checkpointing on XLA device.

This fixes the existing _save_tpu in trainer.py implementation, which is currently broken:

100%|███████████████████████████████████████████| 10/10 [00:44<00:00,  4.49s/it]
[INFO|trainer.py:2821] 2023-12-01 23:04:53,098 >> Saving model checkpoint to /workspace/MNLI
[INFO|configuration_utils.py:458] 2023-12-01 23:05:09,794 >> Configuration saved in /workspace/MNLI/config.json
/usr/local/lib/python3.8/site-packages/safetensors-0.4.1rc1-py3.8-linux-x86_64.egg/safetensors/torch.py:17: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return tensor.storage().data_ptr()
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/safetensors-0.4.1rc1-py3.8-linux-x86_64.egg/safetensors/torch.py", line 13, in storage_ptr
    return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

With this change, it works,

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  2.28it/s][INFO|trainer.py:1975] 2023-12-01 23:44:53,563 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 7.1392, 'train_samples_per_second': 1.121, 'train_steps_per_second': 0.14, 'train_loss': 1.0944234132766724, 'epoch': 1.0}
100%|█████████████████████████████████████████████| 1/1 [00:07<00:00,  7.14s/it]
[INFO|trainer.py:2821] 2023-12-01 23:45:00,259 >> Saving model checkpoint to /workspace/MNLI
[INFO|configuration_utils.py:458] 2023-12-01 23:45:18,980 >> Configuration saved in /workspace/MNLI/config.json
[INFO|modeling_utils.py:1851] 2023-12-01 23:45:20,612 >> Model weights saved in /workspace/MNLI/pytorch_model.bin
[INFO|tokenization_utils_base.py:2210] 2023-12-01 23:45:20,613 >> tokenizer config file saved in /workspace/MNLI/tokenizer_config.json
[INFO|tokenization_utils_base.py:2217] 2023-12-01 23:45:20,613 >> Special tokens file saved in /workspace/MNLI/special_tokens_map.json

… to cpu

yeounoh · 2023-12-02T00:48:33Z

I see this in the CI test run, tests_torch

 if cls.token is None:
>           raise ValueError("Cannot run tests as secret isn't setup.")
E           ValueError: Cannot run tests as secret isn't setup.

yeounoh · 2023-12-02T01:06:10Z

Also, run_tests is blocked, with "This workflow is awaiting approval from a maintainer in #27799"

HuggingFaceDocBuilderDev · 2023-12-03T14:56:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker · 2023-12-04T09:45:07Z

No worries, the test is skipped on main feel free to rebase!

ArthurZucker

Nice catch thanks for the hot fix.

ArthurZucker · 2023-12-04T09:47:32Z

There was an issue with this recently #27578 that should be fixed by this I think

LysandreJik

Thank you, cc @muellerzr @pacman100 for visibility

Asap7772

I am having issues with checkpointing when training on TPU and I suspect this may be the cause.

Asap7772 · 2023-12-19T05:20:50Z

src/transformers/trainer.py

@@ -2831,18 +2831,20 @@ def _save_tpu(self, output_dir: Optional[str] = None):
        xm.rendezvous("saving_checkpoint")
        if not isinstance(self.model, PreTrainedModel):
            if isinstance(unwrap_model(self.model), PreTrainedModel):
-                unwrap_model(self.model).save_pretrained(
+                unwrap_model(self.model).to("cpu").save_pretrained(


I believe that this change causes the model to move to cpu as to is an inplace operation. It is necessary to move the model back to the tpu/xla.

For example in the context of accelerate, I had to run prepare after the to('cpu')

if not isinstance(self.model, PreTrainedModel): if isinstance(unwrap_model(self.model), PreTrainedModel): unwrap_model(self.model).to("cpu").save_pretrained( output_dir, is_main_process=self.args.should_save, state_dict=self.model.state_dict(), save_function=xm.save, ) self.model = self.accelerator.prepare(self.model) else: logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.") state_dict = self.model.state_dict().to("cpu") xm.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME)) else: self.model.to("cpu").save_pretrained( output_dir, is_main_process=self.args.should_save, save_function=xm.save ) self.model = self.accelerator.prepare(self.model)

Though this may be needed to be handled separately for different backends 😄 .

Hi @Asap7772, a recent PR was merged in to address this: #27993

Could you try installing from source to see if resolves the issue for you? If not, could you raise a new github issue?

Thanks, I'll take a look.

save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix huggingface#27799 (issue huggingface#27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint huggingface#28438. This PR disable moving of model to CPU when save_safetensor=False.

jeffhataws · 2024-03-20T17:50:20Z

@yeounoh @ArthurZucker please help check if #29703 would work for TPU?

) save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix #27799 (issue #27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint #28438. This PR disable moving of model to CPU when save_safetensor=False.

[XLA] Re-enable broken _tpu_save for XLATensors, by explicitly moving…

9645ad5

… to cpu

yeounoh force-pushed the hotfix_tpu_save branch from 0e18b25 to 9645ad5 Compare December 1, 2023 23:49

yeounoh changed the title ~~[XLA] Re-enable broken _tpu_save for XLATensors~~ [Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors Dec 1, 2023

linter-fix

fa2f103

ArthurZucker approved these changes Dec 4, 2023

View reviewed changes

ArthurZucker requested a review from LysandreJik December 4, 2023 09:46

LysandreJik approved these changes Dec 4, 2023

View reviewed changes

LysandreJik merged commit 2b5d5ea into huggingface:main Dec 4, 2023
19 of 21 checks passed

Narsil mentioned this pull request Dec 4, 2023

"Attempted to access the data pointer on an invalid python storage" when saving model in TPU mode (Kaggle) #27578

Closed

4 tasks

yeounoh mentioned this pull request Dec 11, 2023

[hf-diffusers] Use the latest transformers for diffusers GoogleCloudPlatform/ml-testing-accelerators#1027

Merged

Asap7772 reviewed Dec 19, 2023

View reviewed changes

jeffhataws mentioned this pull request Mar 17, 2024

Neuron: When save_safetensor=False, no need to move model to CPU #29703

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors #27799

[Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors #27799

yeounoh commented Dec 1, 2023 •

edited

Loading

yeounoh commented Dec 2, 2023 •

edited

Loading

yeounoh commented Dec 2, 2023

HuggingFaceDocBuilderDev commented Dec 3, 2023

ArthurZucker commented Dec 4, 2023

ArthurZucker left a comment

ArthurZucker commented Dec 4, 2023

LysandreJik left a comment

Asap7772 left a comment

Asap7772 Dec 19, 2023

Asap7772 Dec 19, 2023 •

edited

Loading

Asap7772 Dec 19, 2023 •

edited

Loading

amyeroberts Dec 19, 2023

Asap7772 Dec 19, 2023

jeffhataws commented Mar 20, 2024

[Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors #27799

[Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors #27799

Conversation

yeounoh commented Dec 1, 2023 • edited Loading

What does this PR do?

yeounoh commented Dec 2, 2023 • edited Loading

yeounoh commented Dec 2, 2023

HuggingFaceDocBuilderDev commented Dec 3, 2023

ArthurZucker commented Dec 4, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Dec 4, 2023

LysandreJik left a comment

Choose a reason for hiding this comment

Asap7772 left a comment

Choose a reason for hiding this comment

Asap7772 Dec 19, 2023

Choose a reason for hiding this comment

Asap7772 Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

Asap7772 Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

amyeroberts Dec 19, 2023

Choose a reason for hiding this comment

Asap7772 Dec 19, 2023

Choose a reason for hiding this comment

jeffhataws commented Mar 20, 2024

yeounoh commented Dec 1, 2023 •

edited

Loading

yeounoh commented Dec 2, 2023 •

edited

Loading

Asap7772 Dec 19, 2023 •

edited

Loading

Asap7772 Dec 19, 2023 •

edited

Loading