-
-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate ZeRO state before checkpoint saving #2623
Comments
Thanks for the feature request @danieltudosiu ! Yes, from user side, as you suggested one possible alternative solution could be to attach a specific handler for that: to_save = {"zero": zero_optimizer}
checkpoint = Checkpoint(to_save, ...)
recipient_rank = 0
trainer.add_event_handler(Events.EPOCH_COMPLETED, lambda _: zero_optimizer.consolidate_state_dict(to=recipient_rank))
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpoint)
Do you think we should add this specific hard-coded into ignite/ignite/handlers/checkpoint.py Lines 459 to 466 in e45846f
Checkpoint class will have new
|
Hi @vfdev-5, The whole decision boils down to a design decision. Do you want the user to take care of this specific checkpointing logic or do you want it to be seemingly integrated into Ignite?
Be careful cuz this is not the complete solution, consolidate_state_dict MUST be called on all ranks, which means that all ranks need to go through the checkpoint-saving logic so the checkpoint saver must be aware of the rank it is being called on so we do not save on all rank the checkpoints. Besides that, yes I would incline on integrating the ZeRO consolidation logic in the checkpoint class such that the user does not need to fuss around with it. |
Hi @danieltudosiu thanks for your pointers ! Today, ignite/ignite/handlers/checkpoint.py Lines 176 to 185 in e45846f
and internally
Sounds good to me as well. I'll send a PR and will ask to check it. Hope it works for you :) |
Hi @vfdev-5, Thanks for moving so quickly. For my personal use, I already coded the Handler so I am ok ;) I just wanted Ignite to be even more user-friendly <3 Cheers, Dan |
Is your feature request related to a problem? Please describe.
When checkpoint saving occurs there should be a check if the object that state_dict() is called on is a ZeroRedundancyOptimizer instance
Describe the solution you'd like
Prior to the call state_dict() call a consolidate_state_dict() call should be issued. This call needs to be issued on all ranks and point toward the same consolidating rank.
Here there are two design solutions, you instantiate a Checkpoint handler on all ranks and only on the designated rank does it save the checkpoint or do you create a Handler that needs to run before the Checkpoint handler in order to consolidate.
Special care must be addressed to the PyTorch version as the naming scheme of the ZeRO's method arguments has changed between recent versions.
Describe alternatives you've considered
The only alternative is for the users to write themselves Handlers to do that, segregating the checkpoint-saving logic. And it can be written as:
The text was updated successfully, but these errors were encountered: