Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When save a model on TPU, make a copy to be moved to CPU #27993

Merged
merged 3 commits into from
Dec 19, 2023

Conversation

qihqi
Copy link
Contributor

@qihqi qihqi commented Dec 13, 2023

What does this PR do?

When we save a model on TPU, we first move it to CPU because TPU tensors have no
storage. However, we should do it with a copy of the model, so that the original model
still on TPU. Because otherwise model.to('cpu') would also modify the model in-place.
Then, it would raise the following error when that model is used in compute:

indices should be either on cpu or on the same device as the indexed tensor (XLA). When using XLA, the indexed tensor must be an XLA tensor.

Tested by running this command on TPU v4-8:

python3 examples/pytorch/text-classification/run_glue.py \
        --model_name_or_path=distilbert-base-uncased \
        --task_name=MNLI \
        --do_train=true \
        --num_train_epochs=1 \
        --max_seq_length=128 \
        --learning_rate=3e-5 \
        --overwrite_output_dir=true \
        --save_steps=3000 \
        --save_strategy=no --output_dir=/workspace/mnli

cc @muellerzr and @pacman100

@yeounoh
Copy link
Contributor

yeounoh commented Dec 13, 2023

@qihqi this is done inside _save_tpu https://github.com/huggingface/transformers/pull/27799/files, or we missed something?

@yeounoh
Copy link
Contributor

yeounoh commented Dec 13, 2023

@qihqi this is done inside _save_tpu https://github.com/huggingface/transformers/pull/27799/files, or we missed something?

I see, we need to make a copy. Should we do that inside _save_tpu or do it before here?

@vanbasten23
Copy link

Thanks for the fix! Quick question: I used to see this error indices should be either on cpu or on the same device as the indexed tensor (XLA). When using XLA, the indexed tensor must be an XLA tensor. and it seems some index ops failed so I tried to fix the op instead of the model (and it didn't work out). How did you find out it's the model's problem?

@qihqi
Copy link
Contributor Author

qihqi commented Dec 14, 2023

Thanks for the fix! Quick question: I used to see this error indices should be either on cpu or on the same device as the indexed tensor (XLA). When using XLA, the indexed tensor must be an XLA tensor. and it seems some index ops failed so I tried to fix the op instead of the model (and it didn't work out). How did you find out it's the model's problem?

Thanks! So I also started with assumption that the input is wrong, and traced where the input comes from, and arrived here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2668 and here clearly the input is moving to the device.

Also I ran the script the first few steps trains fine. So I start suspecting something happened after training is done.

@qihqi
Copy link
Contributor Author

qihqi commented Dec 14, 2023

Hi @LysandreJik @ArthurZucker would you guys be able to help with a review? Thanks!

@LysandreJik
Copy link
Member

Pinging our TPU expert @muellerzr

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice catch!

@amyeroberts amyeroberts requested review from amyeroberts and ArthurZucker and removed request for ArthurZucker December 15, 2023 14:45
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

I'm not very familiar with TPUs. My only concern with this is, if it behaves the same as the model being on GPU, then it requires enough space for 2 models on the device when copy.deepcopy(self.model) is called i.e. there is no offloading.

@qihqi
Copy link
Contributor Author

qihqi commented Dec 15, 2023

Thanks for adding this!

I'm not very familiar with TPUs. My only concern with this is, if it behaves the same as the model being on GPU, then it requires enough space for 2 models on the device when copy.deepcopy(self.model) is called i.e. there is no offloading.

Good point! I have changed the code to first move it CPU then move it back. PTAL.

@muellerzr muellerzr requested review from amyeroberts and removed request for ArthurZucker December 15, 2023 20:18
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding and iterating!

@amyeroberts
Copy link
Collaborator

@qihqi Just to confirm that you were able to successfully run the script in the PR description with the most recent changes to the PR:

python3 examples/pytorch/text-classification/run_glue.py \
        --model_name_or_path=distilbert-base-uncased \
        --task_name=MNLI \
        --do_train=true \
        --num_train_epochs=1 \
        --max_seq_length=128 \
        --learning_rate=3e-5 \
        --overwrite_output_dir=true \
        --save_steps=3000 \
        --save_strategy=no --output_dir=/workspace/mnli

@qihqi
Copy link
Contributor Author

qihqi commented Dec 18, 2023

@qihqi Just to confirm that you were able to successfully run the script in the PR description with the most recent changes to the PR:

python3 examples/pytorch/text-classification/run_glue.py \
        --model_name_or_path=distilbert-base-uncased \
        --task_name=MNLI \
        --do_train=true \
        --num_train_epochs=1 \
        --max_seq_length=128 \
        --learning_rate=3e-5 \
        --overwrite_output_dir=true \
        --save_steps=3000 \
        --save_strategy=no --output_dir=/workspace/mnli

Yes. (Well, to be exact it's this script: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/blob/master/tests/pytorch/nightly/hf-glue.libsonnet#L41)

@amyeroberts amyeroberts merged commit 5aec50e into huggingface:main Dec 19, 2023
21 checks passed
staghado pushed a commit to staghado/transformers that referenced this pull request Jan 15, 2024
…#27993)

* When save a model, make a copy to be moved to CPU, dont move the original
model

* make deepcopy inside of _save_tpu

* Move to tpu without copy
@celiolarcher
Copy link

Hi everyone,
One question, was this fix supposed to work with XLA FSDP?

I'm having problems with the model.to ('cpu') part before unwrapping the model (memory access errors), and if I unwrap it before applying .to('cpu'), just one shard of the model is saved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants