-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Attempted to access the data pointer on an invalid python storage" when saving model in TPU mode (Kaggle) #27578
Comments
I know you provided a link to the notebook but a minimal reproducer would still be welcomed here! 🤗 cc @LysandreJik this might be related to the latest changes? Do you want to have a look? |
Would be eager to hear your thoughts on it @Narsil |
Hellow |
Seems there's a bug in torch itself there since safetensors is only using public API.
The CPU fix #27799 will work, but only by moving everything to CPU which isn't desirable imo. Do we have XLA/torch experts that could shed some light on how to detect a xla tensor specifically ? (I would implement the same in to cpu in safetensors if the tensor is on an XLA device). Although this could be easily brought up as a bug too to pytorch, no ? @LysandreJik Minimal repro : https://colab.research.google.com/drive/1O9EqLD-Vfp7PGGldNeJtRtq3oUpLnOJV?usp=sharing (Choose TPU runtime) |
Can check |
Also, #27993 could someone help land this? This could resolve this issue. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
As the pr was merged, let's close this |
save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix huggingface#27799 (issue huggingface#27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint huggingface#28438. This PR disable moving of model to CPU when save_safetensor=False.
save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix huggingface#27799 (issue huggingface#27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint huggingface#28438. This PR disable moving of model to CPU when save_safetensor=False.
) save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix #27799 (issue #27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint #28438. This PR disable moving of model to CPU when save_safetensor=False.
) save_safetensor=True is default as of release 4.35.0, which then required TPU hotfix #27799 (issue #27578). However, when the flag save_safetensor is set to False (compatibility mode), moving the model to CPU causes generation of too many graphs during checkpoint #28438. This PR disable moving of model to CPU when save_safetensor=False.
System Info
It keeps happening whenever I try to use TPU mode to fine-tune BERT model for sentiment analysis. Everything works fine in GPU mode. I even tried to downgrade/upgrade TensorFlow & safetensors, but it didn't work either. Can you give me any suggestion?
Link to that notebook: https://www.kaggle.com/code/phttrnnguyngia/final
trainer.save_model('final-result')
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run in Kaggle TPU, Environment: Always use latest environment. Input data is included in the notebook
Expected behavior
Expected to save successfully like when using GPU.
The text was updated successfully, but these errors were encountered: