When save a model on TPU, make a copy to be moved to CPU #27993

qihqi · 2023-12-13T03:42:05Z

What does this PR do?

When we save a model on TPU, we first move it to CPU because TPU tensors have no
storage. However, we should do it with a copy of the model, so that the original model
still on TPU. Because otherwise model.to('cpu') would also modify the model in-place.
Then, it would raise the following error when that model is used in compute:

indices should be either on cpu or on the same device as the indexed tensor (XLA). When using XLA, the indexed tensor must be an XLA tensor.

Tested by running this command on TPU v4-8:

python3 examples/pytorch/text-classification/run_glue.py \
        --model_name_or_path=distilbert-base-uncased \
        --task_name=MNLI \
        --do_train=true \
        --num_train_epochs=1 \
        --max_seq_length=128 \
        --learning_rate=3e-5 \
        --overwrite_output_dir=true \
        --save_steps=3000 \
        --save_strategy=no --output_dir=/workspace/mnli

cc @muellerzr and @pacman100

…inal model

yeounoh · 2023-12-13T21:18:08Z

@qihqi this is done inside _save_tpu https://github.com/huggingface/transformers/pull/27799/files, or we missed something?

yeounoh · 2023-12-13T21:19:28Z

@qihqi this is done inside _save_tpu https://github.com/huggingface/transformers/pull/27799/files, or we missed something?

I see, we need to make a copy. Should we do that inside _save_tpu or do it before here?

vanbasten23 · 2023-12-14T18:08:32Z

Thanks for the fix! Quick question: I used to see this error indices should be either on cpu or on the same device as the indexed tensor (XLA). When using XLA, the indexed tensor must be an XLA tensor. and it seems some index ops failed so I tried to fix the op instead of the model (and it didn't work out). How did you find out it's the model's problem?

qihqi · 2023-12-14T18:13:51Z

Thanks for the fix! Quick question: I used to see this error indices should be either on cpu or on the same device as the indexed tensor (XLA). When using XLA, the indexed tensor must be an XLA tensor. and it seems some index ops failed so I tried to fix the op instead of the model (and it didn't work out). How did you find out it's the model's problem?

Thanks! So I also started with assumption that the input is wrong, and traced where the input comes from, and arrived here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer.py#L2668 and here clearly the input is moving to the device.

Also I ran the script the first few steps trains fine. So I start suspecting something happened after training is done.

qihqi · 2023-12-14T18:17:53Z

Hi @LysandreJik @ArthurZucker would you guys be able to help with a review? Thanks!

LysandreJik · 2023-12-15T08:25:19Z

Pinging our TPU expert @muellerzr

muellerzr

Very nice catch!

amyeroberts

Thanks for adding this!

I'm not very familiar with TPUs. My only concern with this is, if it behaves the same as the model being on GPU, then it requires enough space for 2 models on the device when copy.deepcopy(self.model) is called i.e. there is no offloading.

qihqi · 2023-12-15T19:13:16Z

Thanks for adding this!

I'm not very familiar with TPUs. My only concern with this is, if it behaves the same as the model being on GPU, then it requires enough space for 2 models on the device when copy.deepcopy(self.model) is called i.e. there is no offloading.

Good point! I have changed the code to first move it CPU then move it back. PTAL.

amyeroberts

Thanks for adding and iterating!

amyeroberts · 2023-12-18T10:29:09Z

@qihqi Just to confirm that you were able to successfully run the script in the PR description with the most recent changes to the PR:

python3 examples/pytorch/text-classification/run_glue.py \
        --model_name_or_path=distilbert-base-uncased \
        --task_name=MNLI \
        --do_train=true \
        --num_train_epochs=1 \
        --max_seq_length=128 \
        --learning_rate=3e-5 \
        --overwrite_output_dir=true \
        --save_steps=3000 \
        --save_strategy=no --output_dir=/workspace/mnli

qihqi · 2023-12-18T22:28:52Z

@qihqi Just to confirm that you were able to successfully run the script in the PR description with the most recent changes to the PR:

python3 examples/pytorch/text-classification/run_glue.py \
        --model_name_or_path=distilbert-base-uncased \
        --task_name=MNLI \
        --do_train=true \
        --num_train_epochs=1 \
        --max_seq_length=128 \
        --learning_rate=3e-5 \
        --overwrite_output_dir=true \
        --save_steps=3000 \
        --save_strategy=no --output_dir=/workspace/mnli

Yes. (Well, to be exact it's this script: https://github.com/GoogleCloudPlatform/ml-testing-accelerators/blob/master/tests/pytorch/nightly/hf-glue.libsonnet#L41)

…#27993) * When save a model, make a copy to be moved to CPU, dont move the original model * make deepcopy inside of _save_tpu * Move to tpu without copy

celiolarcher · 2024-02-03T01:20:11Z

Hi everyone,
One question, was this fix supposed to work with XLA FSDP?

I'm having problems with the model.to ('cpu') part before unwrapping the model (memory access errors), and if I unwrap it before applying .to('cpu'), just one shard of the model is saved.

When save a model, make a copy to be moved to CPU, dont move the orig…

c658ffe

…inal model

qihqi force-pushed the fix_save_tpu branch from e3756db to c658ffe Compare December 13, 2023 04:19

make deepcopy inside of _save_tpu

3cbee0f

qihqi force-pushed the fix_save_tpu branch from 9c21732 to 3cbee0f Compare December 13, 2023 21:35

yeounoh mentioned this pull request Dec 14, 2023

"Attempted to access the data pointer on an invalid python storage" when saving model in TPU mode (Kaggle) #27578

Closed

4 tasks

muellerzr approved these changes Dec 15, 2023

View reviewed changes

muellerzr requested a review from ArthurZucker December 15, 2023 13:22

amyeroberts requested review from amyeroberts and ArthurZucker and removed request for ArthurZucker December 15, 2023 14:45

amyeroberts reviewed Dec 15, 2023

View reviewed changes

Move to tpu without copy

13f4774

muellerzr requested review from amyeroberts and removed request for ArthurZucker December 15, 2023 20:18

amyeroberts approved these changes Dec 18, 2023

View reviewed changes

amyeroberts merged commit 5aec50e into huggingface:main Dec 19, 2023
21 checks passed

amyeroberts mentioned this pull request Dec 19, 2023

[Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors #27799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When save a model on TPU, make a copy to be moved to CPU #27993

When save a model on TPU, make a copy to be moved to CPU #27993

qihqi commented Dec 13, 2023 •

edited

Loading

yeounoh commented Dec 13, 2023

yeounoh commented Dec 13, 2023

vanbasten23 commented Dec 14, 2023

qihqi commented Dec 14, 2023

qihqi commented Dec 14, 2023

LysandreJik commented Dec 15, 2023

muellerzr left a comment

amyeroberts left a comment

qihqi commented Dec 15, 2023

amyeroberts left a comment

amyeroberts commented Dec 18, 2023

qihqi commented Dec 18, 2023

celiolarcher commented Feb 3, 2024

When save a model on TPU, make a copy to be moved to CPU #27993

When save a model on TPU, make a copy to be moved to CPU #27993

Conversation

qihqi commented Dec 13, 2023 • edited Loading

What does this PR do?

yeounoh commented Dec 13, 2023

yeounoh commented Dec 13, 2023

vanbasten23 commented Dec 14, 2023

qihqi commented Dec 14, 2023

qihqi commented Dec 14, 2023

LysandreJik commented Dec 15, 2023

muellerzr left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

qihqi commented Dec 15, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts commented Dec 18, 2023

qihqi commented Dec 18, 2023

celiolarcher commented Feb 3, 2024

qihqi commented Dec 13, 2023 •

edited

Loading