-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to resume from checkpoint when using apex #11488
Comments
@dtmoodie thanks for the report. Would you be interested in investigating this issue? |
@awaelchli i believe it's that amp initialize isn't called on the module before loading |
That's a good call. As part of #10416 we should anyway look into this call here where amp.initialize happens: We should check whether we can move the amp.initialize from here to ealier, e.g., setup(). And then get rid of dispatch(), which is currently only implemented by apex amp. cc @four4fish |
I tried to pick this up again with no success. After #11952, the optimizers in DDP get setup after the model has been wrapped (to understand why, read description here #11886). With respect to the plugin setup, the order is the following:
Observation: In order to support reloading the amp state, we need to make sure we call In order to resolve this issue, we need to satisfy these requirements (dictated by apex): Requirement 1: amp.initialize can't be called too early, it has be called AFTER the model has been moved to the device These three requirements contradict the current order in which things are setup (1-4 above). The amp.initialize call can't be inserted anywhere between 1-4 without breaking one of these requirements. I don't see how apex can be supported in our ddp strategies at the moment, without changing the place where optimizers get set up. |
(Comment from offline discussion) Let's just be happy for now with allowing loading checkpoints trained with apex enabled but not reloading the apex state. We would print a warning in this case. This can still be useful for further training or inference. |
We can unfortunately not simply ignore reloading the amp state and still continue training with apex.
The logic would either have to be more involved (in the reloading logic itself outside the plugin), or a hard runtime error. |
IMO that's a bug in APEX. It's been already reported in NVIDIA/apex#1057 with no answer. |
Technically this issue isn't resolved as users are still "unable to resume from checkpoint when using apex". We just improved the UX but it's still blocked by Apex anyways. |
Are there any updates on this issue? |
🐛 Bug
When trying to resume a model that was trained with apex, I cannot load the checkpoint.
To Reproduce
Train model with trainer.fit with the following params:
Then attempt to continue training using the checkpoint and ckpt_path.
The error that I get is:
AttributeError: 'AmpState' object has no attribute 'loss_scalers'
Expected behavior
Training resumes as it would without apex.
Environment
Please copy and paste the output from our environment collection script:
- GPU:
- NVIDIA GeForce RTX 3090
- available: True
- version: 11.4
- numpy: 1.21.2
- pyTorch_debug: False
- pyTorch_version: 1.10.0a0+0aef44c
- pytorch-lightning: 1.5.6
- tqdm: 4.62.3
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.12
- version: Extend CI #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021
conda
,pip
, source): pipcc @carmocca @justusschock @awaelchli @akihironitta @rohitgr7
The text was updated successfully, but these errors were encountered: