-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When the parameter gpus of Trainer> 1, _pickle.PicklingError: #2883
Comments
Hi! thanks for your contribution!, great first issue! |
ddp_spawn requires everything to be pickleable (see https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html#make-models-pickleable). Do you have I believe you are running into this issue because you store something in your traininscript or LightningModule that is not pickleable. You don't see this error for If you can't figure out what it is, try |
Thank you so much! New problem arises. But the frustrating thing is that I have no idea to solve it.The new error is as follows:
and my code
Is the error due to a problem with the geometry library loading the dataset? Is there a good solution? |
A quick search on their github reveals several issues with pickling. It looks like they still need to work on DDP compatibility. https://github.com/rusty1s/pytorch_geometric/issues?q=is%3Aissue+is%3Aopen+pickle
probably just increase num_workers to a higher number :) now back to your new problem. It looks like torch_geometry uses a different type of dataloader so when we try to replace the ddp sampler it fails. Do you have a minimal code you can share in a google colab notebook? It would save me some time to reproduce, but if not it's ok and I'll try myself. |
Thank you for your patience. Since I have not used google colab notebook. I will learn it as soon as possible. Hope to run it on google colab tomorrow. Please wait for me, thank you! |
The code has been reproduced in the google colab notebook. However, only one gpu can be used on colab.
When gpu=1 is tested, and it can run normally. When gpu = 1. Use the command top to observe that the cpu is running on Linux. The cpu utilization never exceeds 100%, but I have 40 cores. In theory, it can be up to 4000%. I suspect it is related to the implementation of the pytorch geometry library. Meanwhile, when using the Cora dataset, the original pytorch training code is used, the cpu can reach 3000%. But using Trainer(gpus=1,max_epochs=100), it will never exceed 100%. I exactly like pytorch-lightning and look forward to solving this problem. Can you give some guidance and suggestions? |
Hi @guangmingjian trainer = Trainer(..., replace_sampler_ddp=False)
dataset = ...
sampler = torch.utils.data.DistributedSampler(dataset, num_replicas=num_gpus, rank=trainer.global_rank)
train_loader = torch_geometric.data.DataLoader(..., sampler=sampler) The reason why we have to do this manually here is that torch geometric has a custom dataloader, and it is impossible for us to know which args we need to pass in. The use a custom collate_fn and set it internally. As far as I know, we cannot automate this. This will solve the problem of the error message, but you will run into a different new problem, a new error saying that your batch is not on the correct device. I found out that torchgeometric uses a custom DataParallel module to move their Batch data to the device, but they don't implement a module for DistributedDataParallel. It's a very similar problem as for torchtext. These datasets/dataloaders return objects that are not simple dicts, and so we don't have a generic way of moving data to the device, scatter and gather it. Maybe the best is to open an issue on their github, although I already see some related ones to distributed. |
Thanks for your answers!!! |
you can try to use the latest version (not the stable one) e.g.: |
@guangmingjian based on @awaelchli 's answer, closing this issue for now. Re-open if you think there is a need for further discussion. |
Hi, I encountered a similar problem. When I set devices=2, accelerator="gpu". I got |
❓ Questions and Help
Before asking:
What is your question?
I have 4 GPUs, only Trainer(gpus=1,row_log_interval=10,max_epochs=100) can run normally. But the cpu utilization rate is extremely low, 40 cores only use 1 core. When the parameter gpus=2, the following error will occur.
Traceback (most recent call last):
File "/home/mingjian/pythoncode/GNNPyg/test/testskorch.py", line 51, in
trainer.fit(model, train_loader)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 988, in fit
results = self.__run_ddp_spawn(model, nprocs=self.num_processes)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1068, in __run_ddp_spawn
mp.spawn(self.ddp_train, nprocs=nprocs, args=(q, model, ))
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
process.start()
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/mingjian/anaconda3/envs/python36/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle typing.Union[torch.Tensor, NoneType]: it's not the same object as typing.Union
Code
trainer = Trainer(gpus=2,row_log_interval=10,max_epochs=100) trainer.fit(model, train_loader)
What have you tried?
What's your environment?
The text was updated successfully, but these errors were encountered: