Disable strict loading in multiprocessing launcher #16365

awaelchli · 2023-01-15T11:59:22Z

What does this PR do?

Sets strict=False when loading the state dict of the model back into the main process. The model in the main process may have a different architecture than the one trained in the worker processes:

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(32, 2)

    def setup(self, stage=None):
        self.layer2 = torch.nn.Linear(32, 2)  # this layer does not exist in main process


model = BoringModel()  # layer2 does not exist yet
trainer = Trainer(strategy="ddp_spawn", ...)
trainer.fit(model) 
# here, at the end of fit, we load model weights back into main process
# but we can only do so for layer1, because layer2 does not exist in main process

This is a limitation of this type of training with the "spawn" method. Since we don't know what the user will do with the model after fit(), the best we can do is load the weights that match.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

cc @Borda @justusschock @awaelchli

for more information, see https://pre-commit.ci

tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py

github-actions · 2023-01-15T13:05:01Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 pytorch_lightning: Tests workflow

Check ID	Status
pl-cpu (macOS-11, pytorch, 3.8, 1.11)	success	✅
pl-cpu (macOS-11, pytorch, 3.9, 1.12)	success	✅
pl-cpu (macOS-11, pytorch, 3.10, 1.13)	success	✅
pl-cpu (macOS-11, pytorch, 3.8, 1.10, oldest)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.8, 1.10)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.9, 1.11)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.12)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.10, 1.13)	success	✅
pl-cpu (ubuntu-20.04, pytorch, 3.7, 1.10, oldest)	success	✅
pl-cpu (windows-2022, pytorch, 3.9, 1.11)	success	✅
pl-cpu (windows-2022, pytorch, 3.10, 1.12)	success	✅
pl-cpu (windows-2022, pytorch, 3.10, 1.13)	success	✅
pl-cpu (windows-2022, pytorch, 3.7, 1.10, oldest)	success	✅
pl-cpu (slow, macOS-11, pytorch, 3.7, 1.11)	success	✅
pl-cpu (slow, ubuntu-20.04, pytorch, 3.7, 1.11)	success	✅
pl-cpu (slow, windows-2022, pytorch, 3.7, 1.11)	success	✅
pl-cpu (macOS-11, lightning, 3.8, 1.13)	success	✅
pl-cpu (ubuntu-20.04, lightning, 3.8, 1.13)	success	✅
pl-cpu (windows-2022, lightning, 3.8, 1.13)	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/multiprocessing.py, tests/tests_pytorch/strategies/launchers/test_multiprocessing.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py.

🟢 pytorch_lightning: Azure GPU

Check ID	Status
pytorch-lightning (GPUs)	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/multiprocessing.py, tests/tests_pytorch/strategies/launchers/test_multiprocessing.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py.

🟢 pytorch_lightning: Azure HPU

Check ID	Status
pytorch-lightning (HPUs)	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/multiprocessing.py, tests/tests_pytorch/strategies/launchers/test_multiprocessing.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py.

🟢 pytorch_lightning: Azure IPU

Check ID	Status
pytorch-lightning (IPUs)	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/multiprocessing.py, tests/tests_pytorch/strategies/launchers/test_multiprocessing.py, tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py.

🟢 pytorch_lightning: Docs

Check ID	Status
make-doctest (pytorch)	success	✅
make-html (pytorch)	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/multiprocessing.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/multiprocessing.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.7)	success	✅
install-pkg (ubuntu-22.04, app, 3.10)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.7)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.10)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.7)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.10)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.7)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.10)	success	✅
install-pkg (ubuntu-22.04, notset, 3.7)	success	✅
install-pkg (ubuntu-22.04, notset, 3.10)	success	✅
install-pkg (macOS-12, app, 3.7)	success	✅
install-pkg (macOS-12, app, 3.10)	success	✅
install-pkg (macOS-12, fabric, 3.7)	success	✅
install-pkg (macOS-12, fabric, 3.10)	success	✅
install-pkg (macOS-12, pytorch, 3.7)	success	✅
install-pkg (macOS-12, pytorch, 3.10)	success	✅
install-pkg (macOS-12, lightning, 3.7)	success	✅
install-pkg (macOS-12, lightning, 3.10)	success	✅
install-pkg (macOS-12, notset, 3.7)	success	✅
install-pkg (macOS-12, notset, 3.10)	success	✅
install-pkg (windows-2022, app, 3.7)	success	✅
install-pkg (windows-2022, app, 3.10)	success	✅
install-pkg (windows-2022, fabric, 3.7)	success	✅
install-pkg (windows-2022, fabric, 3.10)	success	✅
install-pkg (windows-2022, pytorch, 3.7)	success	✅
install-pkg (windows-2022, pytorch, 3.10)	success	✅
install-pkg (windows-2022, lightning, 3.7)	success	✅
install-pkg (windows-2022, lightning, 3.10)	success	✅
install-pkg (windows-2022, notset, 3.7)	success	✅
install-pkg (windows-2022, notset, 3.10)	success	✅

These checks are required after the changes to src/pytorch_lightning/strategies/launchers/multiprocessing.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

src/pytorch_lightning/strategies/launchers/multiprocessing.py

vitkl · 2023-01-23T22:02:52Z

@awaelchli Thank you for quickly implementing this!

It is possible to install pytorch-lightning with this fix (not lightning)? I don't see any way to do this listed here https://pytorch-lightning.readthedocs.io/en/stable/starter/installation.html

carmocca · 2023-01-24T11:47:36Z

Try this:

PACKAGE_NAME=pytorch pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/master.zip -U

vitkl · 2023-01-24T21:22:40Z

This works, thank you!

vitkl · 2023-01-24T22:45:23Z

Although the solution itself doesn't work for my pyro model - the module ends up with no parameters.

module # nn.Module that contain pyro model and pyro guide as attributes module.model and module.guide
training_plan = TrainingPlan(pyro_module=module, **plan_kwargs) # pl.LightningModule
trainer = Trainer(
    max_epochs=max_epochs,
    accelerator=accelerator,
    devices=devices,
    strategy=strategy,
    **trainer_kwargs,
)
trainer.fit(training_plan, data_splitter)
module.state_dict().keys()
# no parameters listed

Is there any point downstream of this change where the contents of state_dict can be ignored or overwritten?

awaelchli · 2023-01-25T00:13:32Z

This is probably because you don't create your layers at the time of instantiation, only later. You will always run into this limitation with the "ddp_spawn" strategy, it's a result of the design. In this case, you should choose strategy="ddp" and you will never have this issue.

vitkl · 2023-01-25T00:42:15Z

I see, thanks for explaining!

And there is no way to modify the strategy - e.g. run Callback setup before loading the parameters in the main process?

awaelchli · 2023-01-25T11:29:08Z

If you want, you can always call the setup() method yourself in the main process:

my_callback.setup("fit") # call so that layers exist after fit
my_model.setup("fit")  # call so that layers exist after fit
trainer.fit(my_model, ...)

But before falling back to this workaround, I suggest just using the regular ddp strategy.

vitkl · 2023-01-29T05:03:58Z

Thanks @awaelchli! Both solutions ddp and dpp_notebook + running callback in the main process seem to work (code runs). However, I see that both solutions lead to identical batches loaded to both devices (the data-loaded tensors such as the pyro plate indices on both devices are identical). I read this #7186 and other related issues - but I don't understand why this should be happening. Is it possible to check if DistributedSampler (replace_sampler_ddp=True) was created successfully?

It appears that worker_init_fn=pl_worker_init_function leads to all workers being initialised in all processes using the same seed. Is this expected?

awaelchli · 2023-01-29T16:10:08Z

Is it possible to check if DistributedSampler (replace_sampler_ddp=True) was created successfully?

You can check isinstance(trainer.train_dataloaders[0].sampler, DistributedSampler) for example.

It appears that worker_init_fn=pl_worker_init_function leads to all workers being initialised in all processes using the same seed. Is this expected?

Call seed_everything(1, workers=True) to seed dataloader workers based on global rank. This doesn't concern the training processes. You can set a different seed like so: seed_everything(seed + trainer.global_rank) for example.

For further questions, please consider posting in the forum or if you find a bug, a new issue would be appreciated (since this topic here is about strict loading of weights).

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz>

* Add .git-blame-ignore-revs (#16709) Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> * Fix strategy type validation in connectors (#16693) * Disable strict loading in multiprocessing launcher (#16365) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> * Fix min-epochs and early-stopping triggering too many validation runs (#16719) Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> * Update hydra-core requirement from <1.3.0,>=1.0.5 to >=1.0.5,<1.4.0 in /requirements (#16736) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [App] Add support for private data (#16738) Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local> * [App] Add rm one level below project level (#16740) Co-authored-by: Ethan Harris <ethanwharris@gmail.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local> * ci: cleaning caches (#16752) * CI: Update colossalai version (#16747) Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> type * Update version and changelog for 1.9.2 --------- Co-authored-by: Akihiro Nitta <nitta@akihironitta.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: thomas chaton <thomas@grid.ai> Co-authored-by: thomas <thomas@thomass-MacBook-Pro.local> Co-authored-by: Ethan Harris <ethanwharris@gmail.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

disable strict loading in mp launcher

3d3a590

awaelchli added the feature Is an improvement or enhancement label Jan 15, 2023

github-actions bot added the pl Generic label for PyTorch Lightning package label Jan 15, 2023

awaelchli added strategy: ddp spawn fun Staff contributions outside working hours - to differentiate from the "community" label and removed pl Generic label for PyTorch Lightning package labels Jan 15, 2023

awaelchli added 4 commits January 15, 2023 13:14

remove warning message

cefa66e

changelog

ee252fb

move and add test

fb16310

unused import

68aba0b

github-actions bot added the pl Generic label for PyTorch Lightning package label Jan 15, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

269d977

for more information, see https://pre-commit.ci

awaelchli commented Jan 15, 2023

View reviewed changes

tests/tests_pytorch/strategies/test_ddp_spawn_strategy.py Show resolved Hide resolved

awaelchli marked this pull request as ready for review January 15, 2023 13:04

awaelchli requested review from carmocca, justusschock and williamFalcon as code owners January 15, 2023 13:04

carmocca approved these changes Jan 16, 2023

View reviewed changes

src/pytorch_lightning/strategies/launchers/multiprocessing.py Show resolved Hide resolved

src/pytorch_lightning/strategies/launchers/multiprocessing.py Show resolved Hide resolved

mergify bot added the has conflicts label Jan 17, 2023

Merge branch 'master' into feature/fit-spawn-missing-keys

ab0d038

mergify bot removed the has conflicts label Jan 17, 2023

awaelchli added bug Something isn't working and removed feature Is an improvement or enhancement labels Jan 18, 2023

awaelchli added this to the v1.9.x milestone Jan 18, 2023

justusschock approved these changes Jan 18, 2023

View reviewed changes

mergify bot added the ready PRs ready to be merged label Jan 18, 2023

awaelchli added 2 commits January 18, 2023 23:20

Merge branch 'master' into feature/fit-spawn-missing-keys

7e0c91f

add comment in source about non-strict load

be48820

awaelchli enabled auto-merge (squash) January 18, 2023 22:23

awaelchli merged commit 7d36db8 into master Jan 18, 2023

awaelchli deleted the feature/fit-spawn-missing-keys branch January 18, 2023 22:53

carmocca mentioned this pull request Feb 11, 2023

Weekly Patch Release v1.9.2 #16725

Merged

awaelchli added a commit that referenced this pull request Feb 11, 2023

Disable strict loading in multiprocessing launcher (#16365)

73cd956

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka <jirka.borovec@seznam.cz>

awaelchli added strategy: ddp DistributedDataParallel and removed strategy: ddp spawn labels Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable strict loading in multiprocessing launcher #16365

Disable strict loading in multiprocessing launcher #16365

awaelchli commented Jan 15, 2023 •

edited

Loading

github-actions bot commented Jan 15, 2023 •

edited

Loading

vitkl commented Jan 23, 2023 •

edited

Loading

carmocca commented Jan 24, 2023

vitkl commented Jan 24, 2023

vitkl commented Jan 24, 2023 •

edited

Loading

awaelchli commented Jan 25, 2023

vitkl commented Jan 25, 2023

awaelchli commented Jan 25, 2023

vitkl commented Jan 29, 2023 •

edited

Loading

awaelchli commented Jan 29, 2023 •

edited

Loading

Disable strict loading in multiprocessing launcher #16365

Disable strict loading in multiprocessing launcher #16365

Conversation

awaelchli commented Jan 15, 2023 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

github-actions bot commented Jan 15, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

vitkl commented Jan 23, 2023 • edited Loading

carmocca commented Jan 24, 2023

vitkl commented Jan 24, 2023

vitkl commented Jan 24, 2023 • edited Loading

awaelchli commented Jan 25, 2023

vitkl commented Jan 25, 2023

awaelchli commented Jan 25, 2023

vitkl commented Jan 29, 2023 • edited Loading

awaelchli commented Jan 29, 2023 • edited Loading

awaelchli commented Jan 15, 2023 •

edited

Loading

github-actions bot commented Jan 15, 2023 •

edited

Loading

vitkl commented Jan 23, 2023 •

edited

Loading

vitkl commented Jan 24, 2023 •

edited

Loading

vitkl commented Jan 29, 2023 •

edited

Loading

awaelchli commented Jan 29, 2023 •

edited

Loading