[TPU] Support PyTorch/XLA FSDP via SPMD #28949

alanwaketan · 2024-02-09T22:41:30Z

What does this PR do?

Summary:
This is the first attempt to enable FSDP via SPMD (FSDPv2) on PyTorch/XLA model.

More information about FSDPv2 can be found here:

A user guide: https://github.com/pytorch/xla/blob/master/docs/fsdpv2.md
A RFC: [RFC] FSDP via SPMD pytorch/xla#6379

Besides the initial implementation of FSDPv2 in r2.2, this change will also requires the following changes in PyTorch/XLA:

[FSDPv2] Enable auto-wrapping pytorch/xla#6499
[FSDPv2] Use the global mesh API pytorch/xla#6500
[SPMD] Introduce global mesh pytorch/xla#6498
[FSDPv2] Move the module to xla device pytorch/xla#6525
Therefore, it will only be compatible with the nightly builds.

Example use cases:

Prepare a FSDPv2 config:

{
    "fsdp_transformer_layer_cls_to_wrap": [
        "LlamaDecoderLayer"
    ],
    "xla": true,
    "xla_fsdp_v2": true,
    "xla_fsdp_grad_ckpt": true
}

Invoke the trainer using the following command:

XLA_USE_SPMD=1 XLA_USE_BF16=1 python3 examples/pytorch/language-modeling/run_clm.py     --num_train_epochs 1     --dataset_name wikitext     --dataset_config_name wikitext-2-raw-v1 --per_device_train_batch_size 128     --do_train     --output_dir /tmp/test-clm     --overwrite_output_dir     --config_name ../transformers_pt/2B.config     --cache_dir /tmp     --tokenizer_name hf-internal-testing/llama-tokenizer      --block_size 1024     --optim adafactor     --save_strategy no     --logging_strategy no --fsdp "full_shard" --fsdp_config fsdp_config.json --torch_dtype bfloat16 --dataloader_drop_last yes

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @younesbelkada

alanwaketan · 2024-02-09T22:44:48Z

Can HF folks point me on how to add test case in this case and also how to update the documentation?

alanwaketan · 2024-02-09T22:53:37Z

cc @yeounoh @jonb377

ArthurZucker

LGTM overall! We might want to add a small test, it can be done in a followup PR.
Pinging @muellerzr for a second look!

ArthurZucker · 2024-02-12T07:42:26Z

src/transformers/trainer.py

+    import torch_xla.distributed.spmd as xs
+    import torch_xla.runtime as xr


I am not super fan of super short names but seems common in trainer!

src/transformers/trainer.py

ArthurZucker · 2024-02-12T07:46:46Z

Tests should be added in the tests/trainer/test_trainer.py file. You should find similar tests!

muellerzr

As @ArthurZucker hinted at, we now don't handle things like this in the trainer directly. I would rather see this code over in accelerate which we can then bring into Trainer automatically since it relies on it for preparation. Especially as this deals with the dataloaders. Would that be possible please! :)

muellerzr · 2024-02-12T16:19:45Z

src/transformers/trainer.py

+
+                if self.is_fsdp_xla_v2_enabled:
+                    from torch_xla.experimental.spmd_fully_sharded_data_parallel import (
+                        SpmdFullyShardedDataParallel as FSDPv2,


Could we make this easier by importing FSDPv2 as FSDP instead?

May I ask what's the benefits of doing so?

muellerzr · 2024-02-12T16:20:13Z

src/transformers/trainer.py

+                        raise ValueError("Something went wrong, the output of the model shouldn't be `None`")
+                    xs.mark_sharding(real_output, mesh, ("fsdp", None, None))
+
+                self.model = model = FSDPv2(


And then leave the check for down here on what to do.

shard_output is not used by FSDPv1. Shouldn't we guard that with the flag too?

alanwaketan · 2024-02-12T21:21:04Z

As @ArthurZucker hinted at, we now don't handle things like this in the trainer directly. I would rather see this code over in accelerate which we can then bring into Trainer automatically since it relies on it for preparation. Especially as this deals with the dataloaders. Would that be possible please! :)

Can you elaborate it a bit more? I can move the model = model.to(xm.xla_device()) logic. But for the dataloader logic, i.e., tpu_spmd_dataloader, where do you suggest me to move it to?

alanwaketan · 2024-02-13T01:36:29Z

Tests should be added in the tests/trainer/test_trainer.py file. You should find similar tests!

Speaking of adding tests, what should I test? I mean do you have TPU CI?

alanwaketan · 2024-02-14T01:16:39Z

src/transformers/trainer.py

+                    # PyTorch/XLA relies on the data loader to insert the mark_step for
+                    # each step. Since we are breaking the loop early, we need to manually
+                    # insert the mark_step here.
+                    if is_torch_tpu_available():


I fixed a bug here. cc @ArthurZucker @jonb377

alanwaketan · 2024-02-14T01:27:00Z

The test failures don't seem to be related. I tried rebasing as well.

alanwaketan · 2024-02-14T19:05:14Z

Thanks @ArthurZucker and @muellerzr for approving the change.

HuggingFaceDocBuilderDev · 2024-02-14T19:50:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

alanwaketan · 2024-02-14T21:15:09Z

It's all green. Can HF folks help with landing the PR? Appreciate it.

amyeroberts · 2024-02-14T21:44:46Z

I can merge :) Thanks for adding this support @alanwaketan!

* Initial commit * Add guards for the global mesh * Address more comments * Move the dataloader into integrations/tpu.py * Fix linters * Make karg more explicitly * Remove the move device logic * Fix the CI * Fix linters * Re-enable checkpointing

ArthurZucker approved these changes Feb 12, 2024

View reviewed changes

ArthurZucker requested a review from muellerzr February 12, 2024 12:49

muellerzr reviewed Feb 12, 2024

View reviewed changes

alanwaketan force-pushed the fsdpv2 branch from 72d9de3 to c4187c7 Compare February 14, 2024 01:15

alanwaketan commented Feb 14, 2024

View reviewed changes

muellerzr approved these changes Feb 14, 2024

View reviewed changes

alanwaketan added 10 commits February 14, 2024 19:04

Initial commit

acb6e9f

Add guards for the global mesh

a33c3cc

Address more comments

a98ee8d

Move the dataloader into integrations/tpu.py

f13f8c5

Fix linters

ad1ee78

Make karg more explicitly

36d091c

Remove the move device logic

76ae042

Fix the CI

b438648

Fix linters

6a009a0

Re-enable checkpointing

b4bdf0a

alanwaketan force-pushed the fsdpv2 branch from c4187c7 to b4bdf0a Compare February 14, 2024 19:04

amyeroberts merged commit 5f06053 into huggingface:main Feb 14, 2024
22 checks passed

shub-kris mentioned this pull request Mar 26, 2024

fix: extend the unwrap_model function and save unwrapped model state dict instead of wrapped #29780

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] Support PyTorch/XLA FSDP via SPMD #28949

[TPU] Support PyTorch/XLA FSDP via SPMD #28949

alanwaketan commented Feb 9, 2024 •

edited

Loading

alanwaketan commented Feb 9, 2024

alanwaketan commented Feb 9, 2024

ArthurZucker left a comment

ArthurZucker Feb 12, 2024

ArthurZucker commented Feb 12, 2024

muellerzr left a comment

muellerzr Feb 12, 2024

alanwaketan Feb 13, 2024

muellerzr Feb 12, 2024

alanwaketan Feb 13, 2024

alanwaketan commented Feb 12, 2024

alanwaketan commented Feb 13, 2024

alanwaketan Feb 14, 2024

alanwaketan commented Feb 14, 2024

alanwaketan commented Feb 14, 2024

HuggingFaceDocBuilderDev commented Feb 14, 2024

alanwaketan commented Feb 14, 2024

amyeroberts commented Feb 14, 2024

		import torch_xla.distributed.spmd as xs
		import torch_xla.runtime as xr

[TPU] Support PyTorch/XLA FSDP via SPMD #28949

[TPU] Support PyTorch/XLA FSDP via SPMD #28949

Conversation

alanwaketan commented Feb 9, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

alanwaketan commented Feb 9, 2024

alanwaketan commented Feb 9, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 12, 2024

Choose a reason for hiding this comment

ArthurZucker commented Feb 12, 2024

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr Feb 12, 2024

Choose a reason for hiding this comment

alanwaketan Feb 13, 2024

Choose a reason for hiding this comment

muellerzr Feb 12, 2024

Choose a reason for hiding this comment

alanwaketan Feb 13, 2024

Choose a reason for hiding this comment

alanwaketan commented Feb 12, 2024

alanwaketan commented Feb 13, 2024

alanwaketan Feb 14, 2024

Choose a reason for hiding this comment

alanwaketan commented Feb 14, 2024

alanwaketan commented Feb 14, 2024

HuggingFaceDocBuilderDev commented Feb 14, 2024

alanwaketan commented Feb 14, 2024

amyeroberts commented Feb 14, 2024

alanwaketan commented Feb 9, 2024 •

edited

Loading