Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] convert_sync_batchnorm missing 'training' attribute #1624

Open
2 tasks done
collinmccarthy opened this issue Jan 15, 2025 · 0 comments
Open
2 tasks done

[Bug] convert_sync_batchnorm missing 'training' attribute #1624

collinmccarthy opened this issue Jan 15, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@collinmccarthy
Copy link

Prerequisite

Environment

(Issue obvious from source)

Reproduces the problem - code sample

(Issue obvious from source)

Reproduces the problem - command or script

(Issue obvious from source)

Reproduces the problem - error message

(Issue obvious from source)

Additional information

In torch.nn.modules.batchnorm.py the SyncBatchNorm.convert_sync_batchnorm() method copies over the training attribute like module_output.training = module.training.

The mmengine version is missing this. However it is present in the revert_sync_batchnorm() method right above.

Not having this will cause a NCCL timeout when the BN layer is kept in eval mode for fine-tuning while the model is in training mode. I figured this out due to a similar issue/solution here

@collinmccarthy collinmccarthy added the bug Something isn't working label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant