You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In torch.nn.modules.batchnorm.py the SyncBatchNorm.convert_sync_batchnorm() method copies over the training attribute like module_output.training = module.training.
The mmengine version is missing this. However it is present in the revert_sync_batchnorm() method right above.
Not having this will cause a NCCL timeout when the BN layer is kept in eval mode for fine-tuning while the model is in training mode. I figured this out due to a similar issue/solution here
The text was updated successfully, but these errors were encountered:
Prerequisite
Environment
(Issue obvious from source)
Reproduces the problem - code sample
(Issue obvious from source)
Reproduces the problem - command or script
(Issue obvious from source)
Reproduces the problem - error message
(Issue obvious from source)
Additional information
In torch.nn.modules.batchnorm.py the
SyncBatchNorm.convert_sync_batchnorm()
method copies over thetraining
attribute likemodule_output.training = module.training
.The mmengine version is missing this. However it is present in the
revert_sync_batchnorm()
method right above.Not having this will cause a NCCL timeout when the BN layer is kept in eval mode for fine-tuning while the model is in training mode. I figured this out due to a similar issue/solution here
The text was updated successfully, but these errors were encountered: