-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lightning 2.1.0 no longer supports saving/loading FSDP checkpoints with PyTorch < 2.0 #18230
Comments
Hi @speediedan We could however try to import from here https://github.com/pytorch/pytorch/blob/v1.13.1/torch/distributed/fsdp/__init__.py and maybe that's enough to make it torch 1.13 compatible. |
Yeah, I tinkered with conditionally using the old While I think we likely could backport this functionality to 1.x if we:
That would be a fair amount of custom code in the backport that could prove a fairly ugly/brittle solution though. As such, it may be worth considering the alternative of:
Open to other thoughts and suggestions of course (of which yours are so often awesome!). What do you think? |
Thanks for the suggestion. Keeping the loading for the model state compatible with 1.13 seems feasible, and warning/error for optimizer state is probably the easiest for now. Would that work for you and the finetuning-scheduler as well? |
Absolutely, sounds great. |
@speediedan Do you plan to work on this? We'd want to fix this before the next release to avoid breaking these checkpoints. |
Not sure if I'll have the bandwidth in the next few days and wouldn't want to hold this up since I know it'll be important to ensure it's in 2.1. Certainly go ahead and implement. Thanks for checking! |
Bug description
With the latest dev commit as of this writing (0aeeb60), Lightning imports do not allow saving/loading of FSDP checkpoints with PyTorch < 2.0:
Also note that both save and load code-paths use the
state_dict_type
context manager and attempt to import from FSDP PyTorch 2.0 locations even with PyTorch < 2.0.https://github.com/Lightning-AI/lightning/blob/0aeeb60566cc0375df3cf1a4458592651f143717/src/lightning/fabric/strategies/fsdp.py#L792-L819
Finally, I don't believe
FullOptimStateDictConfig
is defined in the FSDP 1.x API so that may need to be worked around if support for 1.x FSDP continues.I imagine the above challenges could be surmounted to continue providing support for saving/loading FSDP checkpoints with PyTorch < 2.0 but I wanted to ensure that was the intention. If deprecation of this FSDP functionality for PyTorch 1.x is expected I'll go ahead and begin deprecating this functionality in finetuning-scheduler.
Thanks again for all your invaluable contributions to the open-source ML ecosystem!
What version are you seeing the problem on?
master
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
cc @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: