-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DDP checkpoint #1415
Fix DDP checkpoint #1415
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed only the deprecation part.
lets split this PR please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use torch.distributed.all_gather_object or torch.distributed.broadcast_object_list for gathering pickleable objects (then if you just want to broadcast it you can access the rank0 item...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Currently, DDP creates multiple folders.
broadcast_from_master
which basically broadcasts a value (herestr
) from master to all other nodestraining.utils.distributed_training_utils
tocommon.environment.ddp_utils
. This is handled with deprecation, but the deprecation function was introduced in Add deprecate module #1416 which is why it should be merged first