-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I freeze weights when using FSDP? #807
Comments
cc @pacman100 |
Hello @antopost, the "NO_WRAP" policy doesn't save any CUDA memory as all the parameters of the entire model are gathered during the forward pass instead of a few layers. More details here: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
Next, regarding freezing weights when using FSDP, the weights of FSDP units are flattened wherein each unit can span multiple layers.
The freezing of certain weights requires manual wrapping of the model with each frozen layer wrapped into a separate FSDP unit so that all parameters of that wrapped FSDP unit have the same requires_grad. Please go through https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/ for an example of manual wrapping. You can freeze the model layers, then do manual wrapping into FSD units and pass the model to As |
Thanks for the clarification! |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Running into some issues when freezing weights doing multi-GPU training using FSDP.
I've tried preparing my model before and after freezing the weights, both with different and equally disappointing results.
Preparing before:
freeze_layers
prints this to console:0 _fsdp_wrapped_module.flat_param --> freeze
Preparing after:
I get this error:
Any help would be much appreciated :)
Expected behavior
The specified model layers of each respective process should be set to requires grad=False
The text was updated successfully, but these errors were encountered: