-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WandbLogger cannot be used with 'ddp' #981
Comments
Hi! thanks for your contribution!, great first issue! |
Some hacky solutions: calling def init_ddp_connection(self, *args, **kwargs):
super().init_ddp_connection(*args, **kwargs)
if torch.distributed.get_rank() == 0:
import wandb
wandb.run = None These both seem to only kind-of work and result in multiple independent calls to wandb.init. I think the ideal solution is that the experiment is only ever initialized on rank zero. However doing this means that wandb cannot be initialized in the master thread at all. Better than this probably requires some changes to the wandb API. |
Following up slightly - my hacky solution doesn't really work. It's easy enough though to get the rank zero only solution to work. If this seems like a reasonable solution, let me know and I'll submit a PR. |
well, have observed some issues with |
Hmm, I think this is a slightly different issue (I'm running on Ubuntu so I don't think that's the issue). Pickling also works correctly. |
This particular problem I think stems from this part of the def init(...):
...
# If a thread calls wandb.init() it will get the same Run object as
# the parent. If a child process with distinct memory space calls
# wandb.init(), it won't get an error, but it will get a result of
# None.
# This check ensures that a child process can safely call wandb.init()
# after a parent has (only the parent will create the Run object).
# This doesn't protect against the case where the parent doesn't call
# wandb.init but two children do.
if run or os.getenv(env.INITED):
return run Child processes end up getting
Right now, this is the only part of the logging code that the parent thread calls (I assume it's called when pickling): def __getstate__(self):
state = self.__dict__.copy()
# cannot be pickled
state['_experiment'] = None
# args needed to reload correct experiment
state['_id'] = self.experiment.id
return state If this is changed to: def __getstate__(self):
state = self.__dict__.copy()
# args needed to reload correct experiment
if self._experiment is not None:
state['_id'] = self._experiment.id
else:
state['_id'] = None
# cannot be pickled
state['_experiment'] = None
return state That will ensure that unless the user explicitly logs something or creates the wandb experiment first, then the main thread will not try to create an experiment. Since subsequent logging / saving code is wrapped by the It's also possible that these properties are also called by master. Ideally they would be wrapped to not create the experiment unless it had been already created (i.e. experiment should only be created by a function that is wrapped with the
wandb.init(..., reinit=dist.is_available() and dist.is_initialized() and dist.get_rank() == 0) should force a re-initialization when wandb is already initialzed for rank zero. |
@rmrao I just made a PR. It should be safe to always set reinit=True. I wasn't able to reproduce the specific environment you mentioned. Can you share some code or a Colab with me to be sure this works ok? |
I think I had tried setting reinit without fixing other calls to avoid creating the experiment unless necessary. I don’t think this will produce errors. I haven’t experimented with reinit significantly and I’m not sure what it does in terms of creating a new run in wandb, etc. but I don’t think I have a setup that explicitly causes an error with reinit. |
It would be good also add test for logger in ddp mode |
Has there been any progress on this? What is the fix to enable WandbLogger on 'ddp'? |
Is there any update to this? I'm also hoping to use |
Can someone provide a small script to reproduce this? We can work on a fix and include it in the next release. |
@julianmack try 0.7.3 |
I found a similar bug with the WandB logger where the checkpointing callback expects the However, WandB will set the |
I worked around this issue by adding the following line to the top of my training job setattr(WandbLogger, 'name', property(lambda self: self._name)) |
Just to clarify @parasj's solution, presumably you also have to pass in some Unfortunately this means you're now responsible for generating unique names for subsequent runs. |
I am getting a similar issue when using wandb in the DDP with hugging face implementation of Bert. I am running on one node with multiple GPUs. When I switch to using torch.distributed, I started getting the error below. I am not sure if this is related to what I see in this thread. File "/home/kelmaghr/benchmarks/transformers/src/transformers/trainer_callback.py", line 387, in call_event wandb.log({}) . wandb.log({})Error**kwargs, Error File "/home/kelmaghr/anaconda3/envs/transformers_env/lib/python3.7/site-packages/wandb/sdk/lib/preinit.py", line 38, in preinit_wrapper |
I have the same error, trying to get WandBLogger working on grid.ai |
It is solved here #13166. |
🐛 Bug
wandb modifies
init
such that a child process calling init returns None if the master process has called init. This seems to cause a bug with ddp, and results in rank zero having experiment = None, which crashes the program.To Reproduce
Can be reproduced with the basic MNIST gpu template, simply add a WandbLogger and pass 'ddp' as the distributed backend.
This occurs with the latest wandb version and with pytorch-lightning 0.6.
The text was updated successfully, but these errors were encountered: