-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Grid] You must call wandb.init() before wandb.log() #7028
Comments
I ran this in an interactive session on grid with lightning 1.2.7 import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self, my_param: int = 2):
super().__init__()
self.save_hyperparameters()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
return {"x": loss}
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
return {"y": loss}
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
logger = WandbLogger(project="myproject")
model = BoringModel()
trainer = Trainer(
gpus=-1,
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
weights_summary=None,
logger=logger,
)
trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
trainer.test(model, test_dataloaders=test_data)
if __name__ == '__main__':
run() gridai@ixsession → python repro.py
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
wandb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.26
wandb: Syncing run driven-darkness-4
wandb: ⭐️ View project at https://wandb.ai/awaelchli/myproject
wandb: 🚀 View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9
wandb: Run `wandb offline` to turn off syncing.
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You are using `accelerator=ddp_spawn` with num_workers=0. For much faster performance, switch to `accelerator=ddp` and set `num_workers>0`
warnings.warn(*args, **kwargs)
Epoch 0: 0%| | 0/2 [00:00<?, ?it/s][W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 131.16it/s, loss=-0.0434, v_num=ayd9w
andb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.26
wandb: Resuming run driven-darkness-4
wandb: ⭐️ View project at https://wandb.ai/awaelchli/myproject
wandb: 🚀 View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9/files/wandb/run-20210417_133848-9tigayd9
wandb: Run `wandb offline` to turn off syncing.
Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.47it/s, loss=-0.0434, v_num=ayd9]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: cleaning up ddp environment...
warnings.warn(*args, **kwargs)
I don't get the error message you are mentioning. Any hints as to what I need to modify? |
Here is an example trying to log images or audio to wandb that breaks. The following works (one GPU). Make sure to
If you switch to multiple GPUs, it breaks with:
If you switch to
Basically I want to wandb log images + audio + matplotlibs from within DDP |
Thanks. I tried this and can see where the problem is.
This should work:) I can see the audio samples in the wandb run online. It doesn't play but I think that's because this dummy sample is too short. Furthermore, we currently don't support images, audio etc. in self.log(), since the api depends on the specific logger. There are efforts to standardize this #6720 EDIT: I tried your code with DDP as well. The fix above applies. |
@awaelchli thanks, I will try it. Is this documented somewhere? |
We have a small section here |
I see. Thank. I'm not exactly sure how to make it more clear, but the headline "Manual Logging" is maybe a bit off-base for me. "Manual Logging to a Supported or Custom Logger"? |
I encountered the same issue and found that this can be simply fixed by moving wandb.init to the first line in your main function. |
You can try: |
🐛 Bug
I'm reopening #1356 because I'm getting this error running my code on grid.ai.
I am getting error:
Please reproduce using the BoringModel
Not possible since colab has only one GPU, unlike grid.ai
To Reproduce
On grid.ai or multiple GPU machine, create a trainer with WandbLogger and do not specify an accelerator. Run with gpus=-1 and hit this error.
Despite #2029 the default is ddp_spawn, which triggers this error on grid.ai:
Workaround:
(seems redudant and potentially dangerous/foot-gunny since you are already passing a WandbLogger to the trainer.
Expected behavior
wandb logger works when trainer is given WandbLogger, gpu=-1, and no accelerator is defined, nor is a duplicate wandb init needed to be called.
Environment
grid.ai
- GPU:
- Tesla M60
- Tesla M60
- available: True
- version: 10.2
- numpy: 1.20.2
- pyTorch_debug: False
- pyTorch_version: 1.8.1+cu102
- pytorch-lightning: 1.2.7
- tqdm: 4.60.0
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.10
- version: Proposal for help #1 SMP Tue Mar 16 04:56:19 UTC 2021
The text was updated successfully, but these errors were encountered: