Getting rid of "module." heritage #1184

BloodAxe · 2023-06-16T14:24:07Z

What this PR does

We do not store state_dict of the model with "module." prefix anymore

Why this is important

When training with DDP/DP mode, DistributedDataParallel/DataParallel wrappers adds this "module." prefix unintentionally to state_dict. So there goes discrepancy when the model is trained with or without DDP. Previously SG was solving this with a hack of adding artificial "module." prefix to all saved models in single GPU mode. However, this is a broken design
Some of our checkpoints are saved with and some are saved without "module." which I believe the main cause we have this ugly NO_KEY_MATCHING checkpoint loading hack. The actual reasoning is lost in time, but I'm, pretty sure this is because of this discrepancy of saving models with and without DDP wrapper.

Solution

This PR introduce get_real_model method to return the "real" model if it is wrapped with DP/DDP. Whenever there is a need to save state_dict one should use this method to unwrap the model and save state dict of the real one.
Similarly, when loading a checkpoint same logic should be performed: get_real_model(net).load_state_dict(...)
Compatibility with existing checkpoints that may or may not has "module." prefix is secured buy checking whether all state_dict keys are starting with "module.". If so - we drop them and load checkpoint as usual. This handles the case when checkpoint was saved with DDP wrapper.

Testing

Not properly tested yet
Components requiring special attention

Risks of introducing breaking changes

No risk assessment has been done yet

Related issues

#1163
#1153

Louis-Dupont

Looks great, I am just not sure about the name get_real_model, I think it is pretty confusing.
Maybe unwrap_model? It's also not that good, but maybe slightly more explicit? Any better idea?

BloodAxe · 2023-06-22T12:00:04Z

unwrap_model is good, I like this name)

…feature/SG-000-fix-module

…n someone is about to use it to indicate it is deprecated.

…feature/SG-000-fix-module

Louis-Dupont

LGTM

ofrimasad

LGTM
not sure if the Makefile was changed by mistake

Makefile

# Conflicts: # src/super_gradients/training/sg_trainer/sg_trainer.py

ofrimasad

LGTM

# Conflicts: # src/super_gradients/training/sg_trainer/sg_trainer.py

Louis-Dupont

LGTM

Getting rid of "module." heritage

2613ece

Louis-Dupont reviewed Jun 20, 2023

View reviewed changes

Merge branch 'master' into feature/SG-000-fix-module

719c140

BloodAxe and others added 11 commits June 22, 2023 15:03

Remove import of "WrappedModel",

80b17b7

Merge remote-tracking branch 'origin/feature/SG-000-fix-module' into …

52886c9

…feature/SG-000-fix-module

Merge branch 'master' into feature/SG-000-fix-module

c649db8

Remove remaining usages of .net.module

7e72376

Merge remote-tracking branch 'origin/feature/SG-000-fix-module' into …

ba840fe

…feature/SG-000-fix-module

Remove remaining usages of .net.module

66618b7

Remove remaining usages of .net.module

8025b83

Remove remaining usages of .net.module

dc4924d

Merge branch 'master' into feature/SG-000-fix-module

74f4259

Put back WrappedModel class in place, but add deprecation warning whe…

84b9f34

…n someone is about to use it to indicate it is deprecated.

Fix _yolox_ckpt_solver (updating condition to account missing "module.")

553192c

BloodAxe marked this pull request as ready for review June 27, 2023 08:25

BloodAxe requested review from shaydeci and ofrimasad as code owners June 27, 2023 08:25

BloodAxe added 3 commits June 27, 2023 11:48

Change python3.8 to python

c2e0a10

Merge remote-tracking branch 'origin/feature/SG-000-fix-module' into …

d0ee8c8

…feature/SG-000-fix-module

Merge branch 'master' into feature/SG-000-fix-module

49bd2ff

Louis-Dupont previously approved these changes Jun 27, 2023

View reviewed changes

BloodAxe added 3 commits June 27, 2023 16:14

Merge branch 'master' into feature/SG-000-fix-module

341cf32

Merge branch 'master' into feature/SG-000-fix-module

6e19867

Merge branch 'master' into feature/SG-000-fix-module

2781de7

ofrimasad previously approved these changes Jun 28, 2023

View reviewed changes

Makefile Show resolved Hide resolved

BloodAxe added 2 commits June 28, 2023 11:01

Reorder tests

03540e6

Merge branch 'master' into feature/SG-000-fix-module

c340da8

# Conflicts: # src/super_gradients/training/sg_trainer/sg_trainer.py

BloodAxe dismissed stale reviews from ofrimasad and Louis-Dupont via c340da8 June 28, 2023 08:04

ofrimasad previously approved these changes Jun 28, 2023

View reviewed changes

BloodAxe added 2 commits June 28, 2023 11:23

Merge branch 'master' into feature/SG-000-fix-module

af468c3

Add missing unwrap_model after merge with master

d15eb30

BloodAxe dismissed ofrimasad’s stale review via d15eb30 June 28, 2023 08:25

Merge branch 'master' into feature/SG-000-fix-module

4040f87

# Conflicts: # src/super_gradients/training/sg_trainer/sg_trainer.py

Louis-Dupont approved these changes Jun 28, 2023

View reviewed changes

Merge branch 'master' into feature/SG-000-fix-module

2df464a

BloodAxe merged commit 2546323 into master Jun 28, 2023

BloodAxe deleted the feature/SG-000-fix-module branch June 28, 2023 10:21

This was referenced Aug 10, 2023

How to convert YOLO NAS .pth output into Tflite. #1153

Closed

module._head.anchors._anchors with shape torch.Size([3, 1, 2]) does not match _head.anchors._anchors with shape torch.Size([3, 3, 2]) #1226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting rid of "module." heritage #1184

Getting rid of "module." heritage #1184

BloodAxe commented Jun 16, 2023 •

edited

Loading

Louis-Dupont left a comment

BloodAxe commented Jun 22, 2023

Louis-Dupont left a comment

ofrimasad left a comment

ofrimasad left a comment

Louis-Dupont left a comment

Getting rid of "module." heritage #1184

Getting rid of "module." heritage #1184

Conversation

BloodAxe commented Jun 16, 2023 • edited Loading

What this PR does

Why this is important

Solution

Testing

Risks of introducing breaking changes

Related issues

Louis-Dupont left a comment

Choose a reason for hiding this comment

BloodAxe commented Jun 22, 2023

Louis-Dupont left a comment

Choose a reason for hiding this comment

ofrimasad left a comment

Choose a reason for hiding this comment

ofrimasad left a comment

Choose a reason for hiding this comment

Louis-Dupont left a comment

Choose a reason for hiding this comment

BloodAxe commented Jun 16, 2023 •

edited

Loading