Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpoint pickle during ddp (on master only) #433

Closed
williamFalcon opened this issue Oct 25, 2019 · 4 comments
Closed

Fix checkpoint pickle during ddp (on master only) #433

williamFalcon opened this issue Oct 25, 2019 · 4 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@williamFalcon
Copy link
Contributor

williamFalcon commented Oct 25, 2019

@neggert looks like the new hparams saving fails with ddp/ddp2

File "/private/home/falc/.conda/envs/ddt2/lib/python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py", line 245, in on_epoch_end
self.save_model(filepath, overwrite=True)
File "/private/home/falc/.conda/envs/ddt2/lib/python3.7/site-packages/pytorch_lightning/callbacks/pt_callbacks.py", line 224, in save_model
self.save_function(filepath)
File "/private/home/falc/.conda/envs/ddt2/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer_io.py", line 127, in save_checkpoint
torch.save(checkpoint, filepath)
File "/private/home/falc/.local/lib/python3.7/site-packages/torch/serialization.py", line 224, in save
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/private/home/falc/.local/lib/python3.7/site-packages/torch/serialization.py", line 149, in _with_file_like
return body(f)
File "/private/home/falc/.local/lib/python3.7/site-packages/torch/serialization.py", line 224, in
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/private/home/falc/.local/lib/python3.7/site-packages/torch/serialization.py", line 296, in _save
pickler.dump(obj)
AttributeError: Can't pickle local object 'ArgumentParser.init..identity'
@williamFalcon williamFalcon added the bug Something isn't working label Oct 25, 2019
@neggert
Copy link
Contributor

neggert commented Oct 28, 2019

Will take a look.

@neggert
Copy link
Contributor

neggert commented Oct 30, 2019

I'm having trouble reproducing this. I'm running

python gpu_template.py  --gpus 2 --distributed_backend ddp

on a single node with 2 GPUs (I've locally reverted your patch that catches the error). That runs fine, and I'm able to see that hparams are saved to the checkpoint by loading it manually.

I even tried replacing the ArgumentParser with a HyperOptArgumentParser, but everything still works. Can you share any more details about what you were doing to get this error?

@stale
Copy link

stale bot commented Feb 22, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added won't fix This will not be worked on and removed won't fix This will not be worked on labels Feb 22, 2020
@Borda Borda added the help wanted Open to be worked on label Feb 22, 2020
@Borda
Copy link
Member

Borda commented Feb 22, 2020

@neggert does it mean that the problem is gone? 🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants