Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Grid] You must call wandb.init() before wandb.log() #7028

Closed
turian opened this issue Apr 15, 2021 · 8 comments
Closed

[Grid] You must call wandb.init() before wandb.log() #7028

turian opened this issue Apr 15, 2021 · 8 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@turian
Copy link
Contributor

turian commented Apr 15, 2021

🐛 Bug

I'm reopening #1356 because I'm getting this error running my code on grid.ai.

I am getting error:

wandb.errors.error.Error: You must call wandb.init() before wandb.log()

Please reproduce using the BoringModel

Not possible since colab has only one GPU, unlike grid.ai

To Reproduce

On grid.ai or multiple GPU machine, create a trainer with WandbLogger and do not specify an accelerator. Run with gpus=-1 and hit this error.

Despite #2029 the default is ddp_spawn, which triggers this error on grid.ai:

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.

Workaround:

  1. In main, run
import wandb
wandb.init(project...)

(seems redudant and potentially dangerous/foot-gunny since you are already passing a WandbLogger to the trainer.

  1. Make sure trainer has accelerator=ddp defined.

Expected behavior

wandb logger works when trainer is given WandbLogger, gpu=-1, and no accelerator is defined, nor is a duplicate wandb init needed to be called.

Environment

grid.ai

  • CUDA:
    - GPU:
    - Tesla M60
    - Tesla M60
    - available: True
    - version: 10.2
  • Packages:
    - numpy: 1.20.2
    - pyTorch_debug: False
    - pyTorch_version: 1.8.1+cu102
    - pytorch-lightning: 1.2.7
    - tqdm: 4.60.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.7.10
    - version: Proposal for help #1 SMP Tue Mar 16 04:56:19 UTC 2021
@turian turian added bug Something isn't working help wanted Open to be worked on labels Apr 15, 2021
@SeanNaren SeanNaren changed the title grid.ai Error: You must call wandb.init() before wandb.log() [Grid] You must call wandb.init() before wandb.log() Apr 15, 2021
@awaelchli
Copy link
Contributor

I ran this in an interactive session on grid with lightning 1.2.7

import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self, my_param: int = 2):
        super().__init__()
        self.save_hyperparameters()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)
        return {"x": loss}

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)
        return {"y": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)

    logger = WandbLogger(project="myproject")

    model = BoringModel()
    trainer = Trainer(
        gpus=-1,
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
        logger=logger,
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)


if __name__ == '__main__':
    run()
gridai@ixsession → python repro.py 
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. `Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
wandb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.10.26
wandb: Syncing run driven-darkness-4
wandb: ⭐️ View project at https://wandb.ai/awaelchli/myproject
wandb: 🚀 View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9
wandb: Run `wandb offline` to turn off syncing.

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: MASTER_ADDR environment variable is not defined. Set as localhost
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: You are using `accelerator=ddp_spawn` with num_workers=0. For much faster performance, switch to `accelerator=ddp` and set `num_workers>0`
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                                                                                                                                                                                                                                                         | 0/2 [00:00<?, ?it/s][W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 131.16it/s, loss=-0.0434, v_num=ayd9w
andb: Currently logged in as: awaelchli (use `wandb login --relogin` to force relogin)                                                                                                                                                                                                                       
wandb: Tracking run with wandb version 0.10.26
wandb: Resuming run driven-darkness-4
wandb: ⭐️ View project at https://wandb.ai/awaelchli/myproject
wandb: 🚀 View run at https://wandb.ai/awaelchli/myproject/runs/9tigayd9
wandb: Run data is saved locally in /home/jovyan/wandb/run-20210417_133841-9tigayd9/files/wandb/run-20210417_133848-9tigayd9
wandb: Run `wandb offline` to turn off syncing.

Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.47it/s, loss=-0.0434, v_num=ayd9]
/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: cleaning up ddp environment...
  warnings.warn(*args, **kwargs)

I don't get the error message you are mentioning. Any hints as to what I need to modify?

@turian
Copy link
Contributor Author

turian commented Apr 17, 2021

Here is an example trying to log images or audio to wandb that breaks.

The following works (one GPU). Make sure to pip3 install soundfile first:

import os
import torch
from torch.utils.data import Dataset
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.loggers import WandbLogger
import wandb

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self, my_param: int = 2):
        super().__init__()
        self.save_hyperparameters()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        wandb.log({"examples": [wandb.Audio(torch.rand(32).cpu().numpy(), caption="Nice", sample_rate=32)]})
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)
        return {"x": loss}

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)
        return {"y": loss}

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    val_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)
    test_data = torch.utils.data.DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=0)

    logger = WandbLogger(project="myproject")

    model = BoringModel()
    trainer = Trainer(
#        gpus=-1,
        gpus=1,
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
        logger=logger,
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    trainer.test(model, test_dataloaders=test_data)


if __name__ == '__main__':
    wandb.init()
    run()

If you switch to multiple GPUs, it breaks with:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
<ipython-input-4-b4487cb8ccc5> in <module>
     73 if __name__ == '__main__':
     74     wandb.init()
---> 75     run()

<ipython-input-4-b4487cb8ccc5> in run()
     67         logger=logger,
     68     )
---> 69     trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
     70     trainer.test(model, test_dataloaders=test_data)
     71 

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    497 
    498         # dispath `start_training` or `start_testing` or `start_predicting`
--> 499         self.dispatch()
    500 
    501         # plugin will finalized fitting (e.g. ddp_spawn will load trained model)

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in dispatch(self)
    544 
    545         else:
--> 546             self.accelerator.start_training(self)
    547 
    548     def train_or_test_or_predict(self):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py in start_training(self, trainer)
     71 
     72     def start_training(self, trainer):
---> 73         self.training_type_plugin.start_training(trainer)
     74 
     75     def start_testing(self, trainer):

/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py in start_training(self, trainer)
    106 
    107     def start_training(self, trainer):
--> 108         mp.spawn(self.new_process, **self.mp_spawn_kwargs)
    109         # reset optimizers, since main process is never used for training and thus does not have a valid optim state
    110         trainer.optimizers = []

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    228                ' torch.multiprocessing.start_process(...)' % start_method)
    229         warnings.warn(msg)
--> 230     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    186 
    187     # Loop on join until it returns True or raises an exception.
--> 188     while not context.join():
    189         pass
    190 

/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    142                     error_index=error_index,
    143                     error_pid=failed_process.pid,
--> 144                     exit_code=exitcode
    145                 )
    146 

ProcessExitedException: process 0 terminated with exit code 1

If you switch to self.log, you get:

TypeError: log() missing 1 required positional argument: 'value'

Basically I want to wandb log images + audio + matplotlibs from within DDP

@awaelchli
Copy link
Contributor

awaelchli commented Apr 17, 2021

Thanks. I tried this and can see where the problem is.
Do the following:

  1. Remove the manual wandb.init call at the bottom
  2. Replace wandb.log({"examples": ... }) with self.logger.experiment.log(...)

This should work:) I can see the audio samples in the wandb run online. It doesn't play but I think that's because this dummy sample is too short.

Furthermore, we currently don't support images, audio etc. in self.log(), since the api depends on the specific logger. There are efforts to standardize this #6720
So for these custom objects, you have to call self.logger.experiment.log (which is basically the same as wandb.log)

EDIT: I tried your code with DDP as well. The fix above applies.

@turian
Copy link
Contributor Author

turian commented Apr 17, 2021

@awaelchli thanks, I will try it. Is this documented somewhere?

@awaelchli
Copy link
Contributor

We have a small section here
https://pytorch-lightning.readthedocs.io/en/latest/extensions/logging.html#manual-logging
Open for suggestions if needs improvements.

@turian
Copy link
Contributor Author

turian commented Apr 18, 2021

I see. Thank.

I'm not exactly sure how to make it more clear, but the headline "Manual Logging" is maybe a bit off-base for me. "Manual Logging to a Supported or Custom Logger"?

@yuelei0428
Copy link

I encountered the same issue and found that this can be simply fixed by moving wandb.init to the first line in your main function.

@TaosLezz
Copy link

You can try:
import wandb
wandb.init(mode='disabled')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

4 participants