[WIP] Support FSDP #358

mehdidc · 2023-01-17T15:26:43Z

This PR adds FSDP (https://pytorch.org/docs/stable/fsdp.html) support for training large models than cannot
fit in memory.

The code works already, but still need to be improved, so this is still a draft.

Some scaling plots with sample per sec per gpu, done on JUWELS Booster.

G-14:

X-14 (15B visual, 5B text):

I also tried G-14 as visual encoder together with a pre-trained T5-XXL as text encoder.

Putting again some remarks and possible improvements, discussed earlier in discord:

I see some hanging issues starting from large number of nodes (256 nodes, 1024 GPUs onJUWELS Booster), no single iteration, I don't see anything special on NCCL debugging info except that it does a lot of all_gather,
which is expected from FSDP
CPU offloading and gradient checkpointing are supported
each time encode_image or encode_text or logit_scale were accessed without going through the forward function (so happens when clipping logit scale, or at evaluation), an exception was raised (see [FSDP] caffe2 error in forward method when using fsdp pytorch/pytorch#82461 for reference). The workaround I found is to modify the forward function so that it is possible to encode both text and image (as currently done), or text only, or image only, or use it for clipping logit scale. It would be better if we find a cleaner solution. The solution provided by the issue in pytorch above is to wrap the modules (here text and image encoders) using FSDP, but we need then to change some internals, as part of the text encoder cannot be wrapped as it is an nn.Parameter, FSDP needs an nn.Module. We could use CustomTextCLIP to wrap the text encoder in its entirty as proposed by @rwightman, then we need to deal with logit_scale.
The list of layers to FSDP-wrap is important as it affects the peak memory (documented here https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html#id2) and the layer names are model dependent. e.g. in T5, I am FSDP wrapping T5Block, and in CLIP class of I am wrapping ResidualAttentionBlock. So we need a way to parametrize this, for the moment they are hardcoded. If we just use the default auto policy of FSDP, we get OOM.

orchidmajumder · 2023-01-25T19:42:55Z

One thing that will break in this implementation is the ability to set separate WD for LN parameters or the bias e.g. here -

open_clip/src/training/main.py

Line 270 in 16e229c

    
           exclude = lambda n, p: p.ndim < 2 or "bn" in n or "ln" in n or "bias" in n or 'logit_scale' in n

. Because FSDP will wrap the VisionTransformer or TextTransformer into one FSDP block, these names will not be retained and the filter will fail silently. You can print the layer names after it is wrapped in FSDP to verify this.

To make sure FSDP retains the original name, you will need to pass an additional field in the FSDP constructor called use_orig_params=True. See my discussion in PyTorch forum here - https://discuss.pytorch.org/t/setting-different-weight-decay-values-for-parameters-within-one-fsdp-unit/169862/2. This feature needs PyTorch nightly.

rwightman · 2023-01-26T03:36:59Z

@orchidmajumder good point, I believe that arg is also required if we want to use torch.compile with FSDP, at least in its current state

rom1504 · 2023-01-30T23:25:43Z

Can you rebase on master ?

mehdidc · 2023-01-31T11:43:46Z

Thanks @orchidmajumder @rwightman will look into that ! @rom1504 just rebased.

mehdidc · 2023-02-02T12:52:26Z

Update: layer names to FSDP-wrap are not hardcoded anymore, they can now be provided in the CLI with defaults that will work already with models we have.

mehdidc · 2023-02-04T12:38:37Z

Update: following this thread huggingface/accelerate#807, full / partial locking now works. Currently getting some throughput numbers with mt5-xxl-ViT-G-14

mehdidc · 2023-02-09T09:04:29Z

Update: I mentioned earlier that training was hanging with large nodes (e.g., 256 on JUWELS Booster), after checking lower nb of nodes, it seems that the starting up phase (before displaying the first "INFO | Train Epoch") duration is long and proportional to nb of nodes, which is problematic. e.g. for 128 nodes, the starting up phase takes 24mins, and in 64 nodes it takes 11mins. Will open an issue pytorch. So it is probably not properly hanging for 256, I just did not run it for long enough, but it's a lot of time if it would take 48mins.

This is GPU usage for a 128 nodes run of mt5-xxl-ViT-G-14.
Small GPU usage for the first 24 mins, then it starts to have > 99% usage and that coincides with the first "INFO | Train Epoch" message in the logs.

nkflash · 2023-02-16T06:52:14Z

Hi @mehdidc, Base on your code I try to use ViT-e-14 model, openCLIP will hang after first epoch step with FSDP enable.
Do you meet same issue?

mehdidc · 2023-02-17T21:16:13Z

hey @nkflash, thanks I actually noticed that as well, even with smaller models, I am on it.

EDIT: found a fix, will push soon

mehdidc · 2023-02-18T01:08:17Z

@nkflash pushed, could you please try again? I can confirm that it worked for me

mehdidc · 2023-02-18T19:54:17Z

Thanks, @orchidmajumder , use_orig_params is working as expected. So with pytorch nightly, we can already use it. If we want to also support current pytorch stable version (1.13), wrapping layer norms into in their own FSDP units using the option I added --fsdp-layers-to-wrap would also work, but it won't handle other cases, e.g. biases from MLP layers, we need also to wrap them as well separately, so I am not sure we would be able to support current pytorch current stable (1.13) without more complications in the code, I think for now we just need to document that to the user (except if we find a better solution), i.e. the closest that can be done to get the "correct" behavior is to FSDP-wrap layer norms, but in that case biases from MLPs will be decayed, otherwise one needs pytorch nightly, or next stable version.

The other thing that needs to be changed is:

open_clip/src/training/main.py

Line 271 in 6ed7dd6

    
           exclude = lambda n, p: p.ndim < 2 or "bn" in n or "ln" in n or "bias" in n or 'logit_scale' in n

since FSDP flattens everything, p.ndim is always < 2, so everything would be excluded in the current code, which means
everything will be weight-decayed.
I found out that for ViTs at least, the only additional case that p.ndim < 2 covers is visual.class_embedding (

open_clip/src/open_clip/transformer.py

Line 366 in 6ed7dd6

self.class_embedding = nn.Parameter(scale * torch.randn(width))

),
the rest is covered by the other clauses.
Or, is this supposed to cover something else? @rwightman @rom1504 @mitchellnw @gabrielilharco
What about making exclusions parametrizable, e.g. with regexps ? perhaps to be more explicit about the parameters
to decay.

mitchellnw · 2023-02-18T20:41:31Z

the p.ndim < 2 check should also cover logit_scale

mehdidc · 2023-02-18T20:46:48Z

Yes was thinking of that as well but saw that there is already 'logit_scale' in n in exclude

rwightman · 2023-02-19T00:43:52Z

@mehdidc the position, token class embeddings are typically not decayed as well but looks like that was never done in OpenCLIP, hrmm. I have no_weight_decay methods in timm that return lists of names to exclude from decay to cover the dim >= 2 cases like position embeddings, etc.
https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/vision_transformer.py#L506
https://github.com/rwightman/pytorch-image-models/blob/main/timm/models/maxxvit.py#L1229

I feel that name based decay by itself is error prone and won't generalized well to other models (like timm vision towers), the unflattened dim is the strongest signal (but sometimes you need to use names for some layers like embeddings, etc).

I feel the best approach would be to build a set of names (from both shape and name) to not decay before wrapping in FSDP when that info is available, so move the current code a bit earlier.

rwightman · 2023-02-19T00:46:25Z

On the other topic, I feel it's fine to support nightlies only, I'm already exclusively using nightlies to train the convnext models because it's the only way to get decent bfloat16 support for convolutions. I'm going to add torch.compile soon to see how that works, the nn.MHA is quite a bit faster on nightlies (has a fused kernel).

mehdidc · 2023-02-19T10:06:00Z

@rwightman Thanks for the suggestion, I moved the code a bit earlier now, now it is fixed.

mehdidc · 2023-02-19T12:57:23Z

Update:@rwightman @rom1504 @mitchellnw @gabrielilharco @JeniaJitsev just for info, regarding the starting up phase I mentioned earlier (#358 (comment)), I found out that it is not only proportional to nb of nodes but also model size, but found a fix. Read below if you want more info.

So e.g. with 256 nodes on JUWELS Booster, it took 13mins for ViT-B/32, 16mins for ViT-L/14, and 28mins for ViT-g/14, that is a lot of waste of time.
After checking the trace, I saw that it was not hanging, stuff was happening, and found that FSDP does something special on the first forward pass (https://github.com/pytorch/pytorch/blob/85e0fd0280948a342a916429448fed2486e82aa5/torch/distributed/fsdp/_exec_order_utils.py#L210). After profiling, I found out that they have two for loops (https://github.com/pytorch/pytorch/blob/85e0fd0280948a342a916429448fed2486e82aa5/torch/distributed/fsdp/_exec_order_utils.py#L235) which basically take (for ViT-B/32) 12secs each, and it is done for each FSDP unit (nb of FSDP units is proportional model size as we FSDP wrap residual blocks). If you count the total, you get the explanation of why. The for loops iterate over all pair of ranks basically, about 1M total for the two loops, that shouldn't be slow, the thing is that there is a repeated access to a Tensor that is in GPU (https://github.com/pytorch/pytorch/blob/85e0fd0280948a342a916429448fed2486e82aa5/torch/distributed/fsdp/_exec_order_utils.py#L237) which slow things down. I will open an issue/PR, a simple .cpu() before the for loop solves the problem, it's then a matter of seconds.

orchidmajumder · 2023-02-19T19:59:48Z

I have also observed the delay with FSDP on AWS clusters and actually thought FSDP hangs over a certain number of nodes and didn't pursue it further - thanks for the amazing deep-dive @mehdidc .

nkflash · 2023-02-20T06:06:29Z

@nkflash pushed, could you please try again? I can confirm that it worked for me

I checkout the head code, it works well now

mehdidc · 2023-03-06T17:39:20Z

Update: as the problem with large nodes is solved, following are updated scaling plots up to 1024 GPUs:

G-14:

I also tested freezing a subset of layers, with MT5-XXL as text encoder (5 last blocks trainable, rest is frozen), G-14 as visual encoder (last block is trainable, rest is frozen), patch dropout 0.5

mehdidc · 2023-03-12T19:04:08Z

Update: the first fully trained model with FSDP is finished, I started with a ViT-B/32 on LAION-400M , 32 epochs (96 gpus, local bs of 896, global bs of 86016, lr of 0.001), zero-shot accuracy in ImageNet is 63.6% with ~90K samples/s throughput. Training was done using pytorch-nightly (torch-2.0.0.dev20230218+cu117).

{"dataset": "wds/imagenet1k", "model": "ViT-B-32", "pretrained": "epoch_32.pt", "task": "zeroshot_classification", "metrics": {"acc1": 0.63606, "acc5": 0.87912, "mean_per_class_recall": 0.6360399999999999}, "language": "en"}

it's similar to what we get in https://arxiv.org/pdf/2212.07143.pdf (Table 13)

@rwightman

…eter names to avoid erroneous parameter decay, and decay params by constructing a set of parameter names to decay before FSDP wrapping (thanks to @rwightman)

…ict and we shard the optim state dict after loading

…y from pytorch nightly - fix grad checkpointing offloading to be compatible with pytorch nightly - use sync_module_states

- Only import FSDP modules if possible to avoid import error

…f adding args.distributed_engine

dinov2 code

mehdidc marked this pull request as draft January 17, 2023 15:27

mehdidc force-pushed the fsdp branch from ddf68a6 to ad44bf6 Compare January 31, 2023 10:55

nkflash mentioned this pull request Feb 17, 2023

[FSDP] FSDP enable on open-clip hang in second epoch begin pytorch/pytorch#95055

Closed

mehdidc force-pushed the fsdp branch 3 times, most recently from 1347816 to b118c14 Compare February 19, 2023 10:18

mehdidc changed the title ~~Support FSDP~~ [WIP] Support FSDP Feb 19, 2023

mehdidc mentioned this pull request Feb 28, 2023

[FSDP] slow init phase pytorch/pytorch#95728

Closed

mehdidc force-pushed the fsdp branch from b118c14 to f891252 Compare March 7, 2023 09:20

mehdidc added 26 commits November 3, 2023 13:14

support cpu offload

8820831

wrap residual blocks with FSDP

188bc9c

add forward trick to CustomCLIP

2782ab1

test_training_clip_with_jit test error

afd8ef3

select layers to wrap in FSDP and grad checkpointing

6627268

support unlocking

fd42631

fix hang after epoch finish

4f65c85

use use_orig_params=True (thanks to @nkflash) to use original param…

3bada34

…eter names to avoid erroneous parameter decay, and decay params by constructing a set of parameter names to decay before FSDP wrapping (thanks to @rwightman)

fix distill

f495986

fix FSDP optim state save/load so that we save the full optim state d…

397b8fc

…ict and we shard the optim state dict after loading

offload to cpu when saving checkpoint to avoid OOM

f2c72f8

- use the new ModuleWrapPolicy instead of transformer_auto_wrap_polic…

a69c0a7

…y from pytorch nightly - fix grad checkpointing offloading to be compatible with pytorch nightly - use sync_module_states

use ShardedGradScaler for fsdp, thanks to @nkflash

62980cb

- FSDP printouts: use logging info.

9e47140

- Only import FSDP modules if possible to avoid import error

parametrize FSDP mixed precision

a8d644b

use a boolean param args.fsdp to match current args.horovod instead o…

16013c4

…f adding args.distributed_engine

replace last args.distributed_engine mention in the code

7735cac

fsdp log on rank zero only

f4165f7

minor

3aa42f4

minor

5e167b2

rank0 only and offload to cpu both true as recommended

5704ada

cli parameters description

ffcf226

support CoCa models

d3ab217

fix optimizer resuming in FSDP and remove param/buffer precision

86799c2

use original_model instead of model

0859c84

delete old import

0a98da2

mehdidc force-pushed the fsdp branch from c810597 to 0a98da2 Compare November 3, 2023 13:07

mehdidc added 3 commits November 3, 2023 14:11

remove old zero shot classifier builder

acd5af7

fix again zero-shot eval

67bfcaa

support sharded checkpointing for FSDP to handle large models, following

4206d56

dinov2 code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support FSDP #358

[WIP] Support FSDP #358

mehdidc commented Jan 17, 2023 •

edited

Loading

orchidmajumder commented Jan 25, 2023

rwightman commented Jan 26, 2023 •

edited

Loading

rom1504 commented Jan 30, 2023

mehdidc commented Jan 31, 2023

mehdidc commented Feb 2, 2023

mehdidc commented Feb 4, 2023

mehdidc commented Feb 9, 2023 •

edited

Loading

nkflash commented Feb 16, 2023

mehdidc commented Feb 17, 2023 •

edited

Loading

mehdidc commented Feb 18, 2023 •

edited

Loading

mehdidc commented Feb 18, 2023 •

edited

Loading

mitchellnw commented Feb 18, 2023

mehdidc commented Feb 18, 2023 •

edited

Loading

rwightman commented Feb 19, 2023

rwightman commented Feb 19, 2023

mehdidc commented Feb 19, 2023

mehdidc commented Feb 19, 2023 •

edited

Loading

orchidmajumder commented Feb 19, 2023

nkflash commented Feb 20, 2023

mehdidc commented Mar 6, 2023

mehdidc commented Mar 12, 2023 •

edited

Loading

[WIP] Support FSDP #358

Are you sure you want to change the base?

[WIP] Support FSDP #358

Conversation

mehdidc commented Jan 17, 2023 • edited Loading

orchidmajumder commented Jan 25, 2023

rwightman commented Jan 26, 2023 • edited Loading

rom1504 commented Jan 30, 2023

mehdidc commented Jan 31, 2023

mehdidc commented Feb 2, 2023

mehdidc commented Feb 4, 2023

mehdidc commented Feb 9, 2023 • edited Loading

nkflash commented Feb 16, 2023

mehdidc commented Feb 17, 2023 • edited Loading

mehdidc commented Feb 18, 2023 • edited Loading

mehdidc commented Feb 18, 2023 • edited Loading

mitchellnw commented Feb 18, 2023

mehdidc commented Feb 18, 2023 • edited Loading

rwightman commented Feb 19, 2023

rwightman commented Feb 19, 2023

mehdidc commented Feb 19, 2023

mehdidc commented Feb 19, 2023 • edited Loading

orchidmajumder commented Feb 19, 2023

nkflash commented Feb 20, 2023

mehdidc commented Mar 6, 2023

mehdidc commented Mar 12, 2023 • edited Loading

mehdidc commented Jan 17, 2023 •

edited

Loading

rwightman commented Jan 26, 2023 •

edited

Loading

mehdidc commented Feb 9, 2023 •

edited

Loading

mehdidc commented Feb 17, 2023 •

edited

Loading

mehdidc commented Feb 18, 2023 •

edited

Loading

mehdidc commented Feb 18, 2023 •

edited

Loading

mehdidc commented Feb 18, 2023 •

edited

Loading

mehdidc commented Feb 19, 2023 •

edited

Loading

mehdidc commented Mar 12, 2023 •

edited

Loading