Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError on run_training_teardown method on the master branch #1999

Closed
SubodhDahal opened this issue May 29, 2020 · 13 comments
Closed

TypeError on run_training_teardown method on the master branch #1999

SubodhDahal opened this issue May 29, 2020 · 13 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@SubodhDahal
Copy link
Contributor

🐛 Bug

I switched to the master branch in order to test the bugfix for #1919 but the same code that was running on the stable version 0.7.6 is not running anymore.

To Reproduce

Maybe just switching to the recent master branch would reproduce the issue.

GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
WARNING:lightning:No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0]
INFO:lightning:CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
  File "/home/mycode/src/vae.py", line 457, in <module>
    main()
  File "/home/mycode/src/vae.py", line 450, in main
    run_model(hparams)
  File "/home/mycode/src/vae.py", line 386, in run_model
    trainer.fit(vae)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 894, in fit
    self.single_gpu_train(model)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 502, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in run_pretrain_routine
    self.logger.save()
  File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 10, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 161, in save
    save_hparams_to_yaml(hparams_file, self.hparams)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 151, in save_hparams_to_yaml
    yaml.dump(hparams, fp)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/__init__.py", line 290, in dump
    return dump_all([data], stream, Dumper=Dumper, **kwds)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/__init__.py", line 278, in dump_all
    dumper.represent(data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 27, in represent
    node = self.represent_data(data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 48, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 207, in represent_dict
    return self.represent_mapping('tag:yaml.org,2002:map', data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
    return self.represent_mapping(
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
    return self.represent_mapping(
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
    return self.represent_mapping(
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 48, in represent_data
    node = self.yaml_representers[data_types[0]](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 207, in represent_dict
    return self.represent_mapping('tag:yaml.org,2002:map', data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
    return self.represent_mapping(
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
    return self.represent_mapping(
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
    return self.represent_mapping(
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
    return self.represent_mapping(
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 342, in represent_object
    return self.represent_mapping(
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 118, in represent_mapping
    node_value = self.represent_data(item_value)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 52, in represent_data
    node = self.yaml_multi_representers[data_type](self, data)
  File "/home/mycode/src/venv/lib/python3.8/site-packages/yaml/representer.py", line 317, in represent_object
    reduce = data.__reduce_ex__(2)
TypeError: cannot pickle '_thread.lock' object
Error in atexit._run_exitfuncs:
TypeError: run_training_teardown() missing 1 required positional argument: 'self'

Expected behavior

I would've expected it to run exactly the same way as it does while using the 0.7.6 version

Environment

* CUDA:
	- GPU:
		- GeForce GTX 1050 Ti with Max-Q Design
	- available:         True
	- version:           10.2
* Packages:
	- numpy:             1.18.3
	- pyTorch_debug:     False
	- pyTorch_version:   1.5.0
	- pytorch-lightning: 0.7.7-dev
	- tensorboard:       2.2.1
	- tqdm:              4.46.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         x86_64
	- python:            3.8.2
Error in atexit._run_exitfuncs:
TypeError: run_training_teardown() missing 1 required positional argument: 'self'
  • How you installed PyTorch (conda, pip, source): pip

Additional context

The same error is present while running the collect_env_details.py script.

@SubodhDahal SubodhDahal added the help wanted Open to be worked on label May 29, 2020
@Borda
Copy link
Member

Borda commented May 29, 2020

@justusschock ^^

@Borda Borda added the bug Something isn't working label May 29, 2020
@awaelchli
Copy link
Contributor

awaelchli commented May 30, 2020

Probably this line here
https://github.com/PyTorchLightning/pytorch-lightning/blob/ceecf1cea92dc2d8c29b1364237ac9467abf2f9b/pytorch_lightning/trainer/training_loop.py#L309
should be
self.run_training_teardown()

Not sure

@awaelchli
Copy link
Contributor

awaelchli commented May 31, 2020

Investigated.
It is because the atexit.register decorator can only be applied to functions, not methods.
The self argument is not passed in.

This error should have shown up in the tests.

@justusschock
Copy link
Member

@awaelchli so changing this to self.run_training_teardown would fix it?

@awaelchli
Copy link
Contributor

No, I tested it, the problem is not there. The problem is that the atexit.register is applied to a method (which has self as an argument) but the decorator is meant for functions which don't get self as input.
It seems this is causing the problem.

@justusschock
Copy link
Member

Yes. I think, that's why I explicitly passed the self argument :) I'll try to come up with another solution for this :)

@awaelchli
Copy link
Contributor

Maybe we can wrap the cleanup code into a closure that binds self and then the decorator can be applied to this closure function. Not sure though, have not looked at the details

@justusschock
Copy link
Member

Do you want to take this over? Otherwise I'd try to make some time for it later/tomorrow :)

@williamFalcon
Copy link
Contributor

williamFalcon commented Jun 1, 2020

more broadly, why is that function needed? we already have teardown that works on ctrl+c no?
is this to teardown with a USSIG1 as well?

And we already have that signal registered...

https://github.com/PyTorchLightning/pytorch-lightning/blob/8b9b923ca8ad9fdb0ae22928de0029e7c2e7a782/pytorch_lightning/trainer/training_io.py#L206

@justusschock
Copy link
Member

This would start teardown for all kills except SIGKILL (like SIGTERM etc.).

There are clusters (like non-SLURM) that also need this kind of signal handling. And I think we should do cleanup whenever a job ends if possible (either exit after program end or exit on error). Otherwise you may get issues with checkpointing etc.

Also this would maybe enable proper hparam logging with metrics (not sure about that though).

@awaelchli
Copy link
Contributor

Do you want to take this over? Otherwise I'd try to make some time for it later/tomorrow :)

I think I better keep my hands away from it. I can't test the SLURM signals anyway due to lack of this setup.

@Borda
Copy link
Member

Borda commented Jun 1, 2020

Do you want to take this over? Otherwise I'd try to make some time for it later/tomorrow :)

I think I better keep my hands away from it. I can't test the SLURM signals anyway due to lack of this setup.

SLURM shall be also running on CPU :]

@williamFalcon
Copy link
Contributor

@awaelchli please see comments. i'm not sure we should have this handler thing. i think it'll also break ddp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
6 participants