-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync module states during non-fit #17370
Conversation
⚡ Required checks status: All passing 🟢Groups summary🟢 pytorch_lightning: Tests workflow
These checks are required after the changes to 🟢 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟢 pytorch_lightning: Docs
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟢 installThese checks are required after the changes to Thank you for your contribution! 💜
|
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> (cherry picked from commit 97a6186)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> (cherry picked from commit 97a6186)
Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> (cherry picked from commit 97a6186)
What does this PR do?
Using DDP, we only wrap with
DistributedDataParallel
withtrainer.fit
.Since the wrapper normally takes care of synchronizing the parameters and buffers, this was missing during evaluation.
This is not a big bug because most use cases will load a checkpoint during evaluation.
FSDP doesn't have this issue because we always use its wrapper.
Fabric doesn't have this issue because there's no logic around fitting vs non-fitting
cc @Borda @justusschock @awaelchli