[Bug] convert_sync_batchnorm missing 'training' attribute #1624

collinmccarthy · 2025-01-15T06:50:22Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

(Issue obvious from source)

Reproduces the problem - code sample

(Issue obvious from source)

Reproduces the problem - command or script

(Issue obvious from source)

Reproduces the problem - error message

(Issue obvious from source)

Additional information

In torch.nn.modules.batchnorm.py the SyncBatchNorm.convert_sync_batchnorm() method copies over the training attribute like module_output.training = module.training.

The mmengine version is missing this. However it is present in the revert_sync_batchnorm() method right above.

Not having this will cause a NCCL timeout when the BN layer is kept in eval mode for fine-tuning while the model is in training mode. I figured this out due to a similar issue/solution here

The text was updated successfully, but these errors were encountered:

collinmccarthy added the bug Something isn't working label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] convert_sync_batchnorm missing 'training' attribute #1624

[Bug] convert_sync_batchnorm missing 'training' attribute #1624

collinmccarthy commented Jan 15, 2025

[Bug] convert_sync_batchnorm missing 'training' attribute #1624

[Bug] convert_sync_batchnorm missing 'training' attribute #1624

Comments

collinmccarthy commented Jan 15, 2025

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information