Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DDP checkpoint #1415

Merged
merged 9 commits into from
Aug 30, 2023
Merged

Conversation

Louis-Dupont
Copy link
Contributor

@Louis-Dupont Louis-Dupont commented Aug 24, 2023

Description

Currently, DDP creates multiple folders.

  • To fix this, we introduce broadcast_from_master which basically broadcasts a value (here str) from master to all other nodes
  • This change required to move some ddp utils functions from training.utils.distributed_training_utils to common.environment.ddp_utils. This is handled with deprecation, but the deprecation function was introduced in Add deprecate module #1416 which is why it should be merged first

@Louis-Dupont Louis-Dupont marked this pull request as ready for review August 27, 2023 07:22
Copy link
Collaborator

@ofrimasad ofrimasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed only the deprecation part.
lets split this PR please

src/super_gradients/common/deprecate.py Outdated Show resolved Hide resolved
src/super_gradients/common/deprecate.py Outdated Show resolved Hide resolved
src/super_gradients/common/deprecate.py Outdated Show resolved Hide resolved
Copy link
Contributor

@shaydeci shaydeci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use torch.distributed.all_gather_object or torch.distributed.broadcast_object_list for gathering pickleable objects (then if you just want to broadcast it you can access the rank0 item...

src/super_gradients/common/environment/ddp_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@BloodAxe BloodAxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ofrimasad ofrimasad merged commit 3e1019f into master Aug 30, 2023
@ofrimasad ofrimasad deleted the hotfix/SG-000-broadcast_to_master_run_id branch August 30, 2023 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants