[FSDP] Zero 3 Optimization Support #4903

klshuster · 2022-12-01T22:20:52Z

Patch description

This PR adds support for zero3 fsdp optimization. Specify via --ddp-backend zero3. Zero3 optimization shards not only the optimizer and gradients but also the model weights. This improves memory pressure, and is especially useful for larger models. Note that this can increase latency as there is more communication cost, and as such for smaller models there may be a slight hit to speed compared to zero2 (and model parallel, of course).

Addresses #3753

NOTE: This requires pytorch >= 1.12 for use

Testing steps

Enabled the zero3 tests already in CI (thanks Stephen!)
Confirmed that zero3 works when the number of validation examples is not divisible by the number of GPUs
Confirmed that zero3 works when --skip-generation False during training.
Conducted a comprehensive analysis of a variety of "staple" models within ParlAI under different distributed settings; see screenshots below

The main conclusions from each of the models below are:

Performance remains roughly the same regardless of model parallel, zero2, or zero3
Speed: for BART, zero2 remains faster. For the other models, we see varying results; ultimately, everything is faster than model parallel
GPU Memory: zero3

BART-Large (400M)

T5-Large (770M)

Reddit 2.7B (base of BlenderBot)

GPT2-XL (1.5B)

R2C2 2.7B

stephenroller

nice, pretty smooth implementation. i might recommend just axing support for fairscale but you're the boss

stephenroller · 2022-12-01T22:27:36Z

parlai/core/torch_generator_agent.py

@@ -516,6 +516,16 @@ def __init__(self, opt: Opt, shared=None):
        else:
            # this is not a shared instance of this class, so do full init
            self.criterion = self.build_criterion()
+
+            def load_init_model() -> Dict[str, Any]:


how is this used?

ahh it's not, that's an artifact. will delete

stephenroller · 2022-12-01T22:28:14Z

parlai/scripts/distributed_train.py

-            self.train_loop = single_train.TrainLoop(opt)
-            return self.train_loop.train()
+            self.train_loop = fsdp_utils.JoinableTrainLoop(opt)
+            with fsdp_utils.fsdp_join(self.train_loop):


do you need to do this for distributed_eval too?

yes, forgot about that. also multiprocessing eval

stephenroller · 2022-12-01T22:28:50Z

parlai/utils/fsdp.py

+        from fairscale.nn.wrap.auto_wrap import wrap, enable_wrap
+        from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP


why keep around fairscale support?

so getting rid of fairscale would force anyone who wants to use distributed training to be on pytorch >=1.12. Is that a reasonable ask, you think?

That's your call but pytorch 1.13 is out (and I'm using it successfully)

i've removed fairscale FSDP

klshuster · 2022-12-02T22:56:10Z

Note: There is currently a bug in pytorch 1.13 that doesn't allow specifying your own pickle module for loading; this is breaking some tests (test_apex.py); the fix is merged in pytorch/pytorch#88570, but we'll need to wait for pytorch to patch to allow us to pass tests with pytorch 1.13

for now, will have requirements specify latest 1.12 version

klshuster added 2 commits December 1, 2022 16:34

zero3 init commit

9d5a413

minor cleanup:

b26daf9

klshuster requested a review from stephenroller December 1, 2022 22:20

facebook-github-bot added the CLA Signed label Dec 1, 2022

stephenroller reviewed Dec 1, 2022

View reviewed changes

stephenroller approved these changes Dec 1, 2022

View reviewed changes

klshuster added 5 commits December 2, 2022 14:14

handle mpeval

e5190a3

remove fairscale dependence

5d2b561

Merge branch 'main' into zero3

ceab726

fsdp avail

8378a9b

update reqs

8c026d2

klshuster added 3 commits December 2, 2022 17:57

better reqs

3edeec3

autoformat

67175da

autofromat

84b501e

klshuster merged commit 96aa1bb into main Dec 5, 2022

klshuster deleted the zero3 branch December 5, 2022 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Zero 3 Optimization Support #4903

[FSDP] Zero 3 Optimization Support #4903

klshuster commented Dec 1, 2022

stephenroller left a comment

stephenroller Dec 1, 2022

klshuster Dec 1, 2022

stephenroller Dec 1, 2022

klshuster Dec 1, 2022

stephenroller Dec 1, 2022

klshuster Dec 1, 2022

stephenroller Dec 1, 2022

klshuster Dec 2, 2022

klshuster commented Dec 2, 2022

		from fairscale.nn.wrap.auto_wrap import wrap, enable_wrap
		from fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP

[FSDP] Zero 3 Optimization Support #4903

[FSDP] Zero 3 Optimization Support #4903

Conversation

klshuster commented Dec 1, 2022

Patch description

Testing steps

BART-Large (400M)

T5-Large (770M)

Reddit 2.7B (base of BlenderBot)

GPT2-XL (1.5B)

R2C2 2.7B

stephenroller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klshuster commented Dec 2, 2022