Enable AMP for xla:gpu device in trainer class #15022

ymwangg · 2022-01-04T04:19:15Z

What does this PR do?

This PR enables AMP in trainer class for xla:gpu device.

Discussion

It looks like the torch_xla support in trainer class is primarily for xla:tpu device.
I found the following features may be useful but not essential and I can include them in this PR if necessary:

Rename tpu to xla in the codebase.
Currently xla device is always turned on when torch_xla is installed. It may be useful to allow users to optionally turn it off without uninstalling torch_xla.
Currently users need to set GPU_NUM_DEVICES manually when using xla:gpu device. It may be useful to set a default value for it when torch_xla and cuda devices are available.

LysandreJik · 2022-01-06T10:17:13Z

Oh, interesting! Thanks for your contribution, pinging @sgugger on the issue.

sgugger

I'm not entirely sure about this PR in the sense that PyTorch XLA support is mainly for TPU, and I don't know if traditional mixed precision training with the gradient scaler will work on TPUs.

So we should probably split the test to detect if we have GPUs available or TPUs. Some of the logic will stay common between the two, but the mixed precision part might only work for XLA GPUs?

sgugger · 2022-01-10T14:25:38Z

src/transformers/trainer.py

+            if is_torch_tpu_available():
+                xm.mark_step()


This part is done by the dataloader (which is wrapped in a ParallelLoader), so it shouldn't be here.

This is intentional since loss will be materialized in self._nested_gather(loss.repeat(batch_size)) and adding a mark_step here can significantly improve the speed. For example, the evaluation time of bert-base-uncased using run_mlm.py will be reduced from 32.53s to 18.73s by adding this mark_step.

sgugger · 2022-01-10T14:26:35Z

src/transformers/training_args.py

@@ -811,7 +811,7 @@ def __post_init__(self):
                raise ValueError("sharded_ddp is not supported with bf16")
        if (
            is_torch_available()
-            and self.device.type != "cuda"
+            and (self.device.type != "cuda" and self.device.type != "xla")


Will this be false on TPU? The test is there for that purpose since mixed precision training does not work on TPU.

You are right that this check won't filter out TPU. We can change it to not (self.device.type == "xla" and "GPU_NUM_DEVICES" in os.environ).

ymwangg · 2022-01-11T18:06:38Z

I'm not entirely sure about this PR in the sense that PyTorch XLA support is mainly for TPU, and I don't know if traditional mixed precision training with the gradient scaler will work on TPUs.

So we should probably split the test to detect if we have GPUs available or TPUs. Some of the logic will stay common between the two, but the mixed precision part might only work for XLA GPUs?

Right, XLA:TPU does not support AMP and only XLA:GPU support it.

sgugger · 2022-01-13T14:14:33Z

Right, XLA:TPU does not support AMP and only XLA:GPU support it.

So as I said in my previous comment, could you add a new test is_gpu_xla_available and use this one for the part where you add grad scalers? Otherwise the changes will make the Trainer stop working on TPU.

ymwangg · 2022-01-13T19:11:45Z

@sgugger Maybe I'm missing something. Could you elaborate why the changes will make the Trainer stop working on TPU? The code inside self.do_grad_scaling is unreachable if running on TPU since either --fp16 or --bf16 option will raise error on TPU. I tested the following training script on TPU:

python run_mlm.py \
    --model_name_or_path bert-base-uncased \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --overwrite_output_dir true \
    --output_dir /tmp/test-mlm \
    --per_device_train_batch_size 10 \
    --do_eval \
    --do_train

With master branch:

WARNING:root:TPU has started up successfully with version pytorch-1.9
WARNING:__main__:Process rank: -1, device: xla:1, n_gpu: 0distributed training: False, 16-bits training: False
...
***** train metrics *****
  epoch                    =        3.0
  train_loss               =     1.7568
  train_runtime            = 0:12:23.47
  train_samples            =       4627
  train_samples_per_second =      18.67
  train_steps_per_second   =      1.868

With this PR:

WARNING:root:TPU has started up successfully with version pytorch-1.9
WARNING:__main__:Process rank: -1, device: xla:1, n_gpu: 0distributed training: False, 16-bits training: False
...
***** train metrics *****
  epoch                    =        3.0
  train_loss               =     1.7577
  train_runtime            = 0:10:19.70
  train_samples            =       4627
  train_samples_per_second =     22.399
  train_steps_per_second   =      2.241

sgugger · 2022-01-13T20:20:23Z

Ah yes, you're right. Thanks for testing!

Multiple fixes of trainer class with XLA GPU

b49ef4a

ymwangg force-pushed the fix_trainer_xla2 branch from 60ddcda to b49ef4a Compare January 4, 2022 04:46

LysandreJik requested a review from sgugger January 6, 2022 10:17

sgugger reviewed Jan 10, 2022

View reviewed changes

Make fp16 valid for xla:gpu

a0a81c8

Add mark_step in should_log to reduce compilation overhead

0e37806

ymwangg force-pushed the fix_trainer_xla2 branch from 356f161 to 0e37806 Compare January 13, 2022 04:28

sgugger approved these changes Jan 13, 2022

View reviewed changes

sgugger merged commit 6e058e8 into huggingface:master Jan 13, 2022

ymwangg mentioned this pull request Sep 7, 2022

Fix XLA fp16 and bf16 error checking #18913

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable AMP for xla:gpu device in trainer class #15022

Enable AMP for xla:gpu device in trainer class #15022

ymwangg commented Jan 4, 2022

LysandreJik commented Jan 6, 2022

sgugger left a comment

sgugger Jan 10, 2022

ymwangg Jan 10, 2022

sgugger Jan 10, 2022

ymwangg Jan 10, 2022 •

edited

Loading

ymwangg commented Jan 11, 2022

sgugger commented Jan 13, 2022

ymwangg commented Jan 13, 2022

sgugger commented Jan 13, 2022

Enable AMP for xla:gpu device in trainer class #15022

Enable AMP for xla:gpu device in trainer class #15022

Conversation

ymwangg commented Jan 4, 2022

What does this PR do?

Discussion

LysandreJik commented Jan 6, 2022

sgugger left a comment

Choose a reason for hiding this comment

sgugger Jan 10, 2022

Choose a reason for hiding this comment

ymwangg Jan 10, 2022

Choose a reason for hiding this comment

sgugger Jan 10, 2022

Choose a reason for hiding this comment

ymwangg Jan 10, 2022 • edited Loading

Choose a reason for hiding this comment

ymwangg commented Jan 11, 2022

sgugger commented Jan 13, 2022

ymwangg commented Jan 13, 2022

sgugger commented Jan 13, 2022

ymwangg Jan 10, 2022 •

edited

Loading