-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When save a model on TPU, make a copy to be moved to CPU #27993
Conversation
@qihqi this is done inside |
I see, we need to make a copy. Should we do that inside |
Thanks for the fix! Quick question: I used to see this error |
Thanks! So I also started with assumption that the input is wrong, and traced where the input comes from, and arrived here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2668 and here clearly the input is moving to the device. Also I ran the script the first few steps trains fine. So I start suspecting something happened after training is done. |
Hi @LysandreJik @ArthurZucker would you guys be able to help with a review? Thanks! |
Pinging our TPU expert @muellerzr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
I'm not very familiar with TPUs. My only concern with this is, if it behaves the same as the model being on GPU, then it requires enough space for 2 models on the device when copy.deepcopy(self.model)
is called i.e. there is no offloading.
Good point! I have changed the code to first move it CPU then move it back. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding and iterating!
@qihqi Just to confirm that you were able to successfully run the script in the PR description with the most recent changes to the PR:
|
Yes. (Well, to be exact it's this script: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/blob/master/tests/pytorch/nightly/hf-glue.libsonnet#L41) |
…#27993) * When save a model, make a copy to be moved to CPU, dont move the original model * make deepcopy inside of _save_tpu * Move to tpu without copy
Hi everyone, I'm having problems with the model.to ('cpu') part before unwrapping the model (memory access errors), and if I unwrap it before applying .to('cpu'), just one shard of the model is saved. |
What does this PR do?
When we save a model on TPU, we first move it to CPU because TPU tensors have no
storage. However, we should do it with a copy of the model, so that the original model
still on TPU. Because otherwise
model.to('cpu')
would also modify the model in-place.Then, it would raise the following error when that model is used in compute:
Tested by running this command on TPU v4-8:
cc @muellerzr and @pacman100