-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting rid of "module." heritage #1184
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I am just not sure about the name get_real_model
, I think it is pretty confusing.
Maybe unwrap_model
? It's also not that good, but maybe slightly more explicit? Any better idea?
|
…feature/SG-000-fix-module
…feature/SG-000-fix-module
…n someone is about to use it to indicate it is deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
not sure if the Makefile was changed by mistake
# Conflicts: # src/super_gradients/training/sg_trainer/sg_trainer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# Conflicts: # src/super_gradients/training/sg_trainer/sg_trainer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What this PR does
We do not store state_dict of the model with "module." prefix anymore
Why this is important
When training with DDP/DP mode, DistributedDataParallel/DataParallel wrappers adds this "module." prefix unintentionally to
state_dict
. So there goes discrepancy when the model is trained with or without DDP. Previously SG was solving this with a hack of adding artificial "module." prefix to all saved models in single GPU mode. However, this is a broken designSome of our checkpoints are saved with and some are saved without "module." which I believe the main cause we have this ugly NO_KEY_MATCHING checkpoint loading hack. The actual reasoning is lost in time, but I'm, pretty sure this is because of this discrepancy of saving models with and without DDP wrapper.
Solution
get_real_model
method to return the "real" model if it is wrapped with DP/DDP. Whenever there is a need to save state_dict one should use this method to unwrap the model and save state dict of the real one.Testing
Not properly tested yet
Components requiring special attention
Risks of introducing breaking changes
No risk assessment has been done yet
Related issues
#1163
#1153