trainer.fit() stuck and cannot interrupt kernel #5947

ifsheldon · 2021-02-12T22:27:35Z

ifsheldon
Feb 12, 2021

Hi! I am now transferring from "old" PyTorch to pytorch-lightning, but when I did some trivial training integrating existing models, I found trainer.fit() is stuck even before GPUs run.

By "stuck" I mean I waited for 5 minutes, but nothing seems to be running, since I checked using htop and nvidia-smi, CPUs and GPUs are idle.

My code is just one-pager as below

import torch
import torchvision
from torchvision import transforms
from torchvision import models
import utils
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

baseline_imagenet_dataset = torchvision.datasets.ImageNet(root = "../../datasets/ImageNet/train/", 
                                                 split="train",
                                                 transform = transforms.Compose([
                                                     transforms.Resize(256),
                                                     transforms.CenterCrop(224),
                                                     transforms.ToTensor(),
                                                     transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
                                                 ])
                                                )

baseline_imagenet_dataset_val = torchvision.datasets.ImageNet(root = "../../datasets/ImageNet/val/", 
                                                 split="val",
                                                 transform = transforms.Compose([
                                                     transforms.Resize(256),
                                                     transforms.CenterCrop(224),
                                                     transforms.ToTensor(),
                                                     transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
                                                 ])
                                                )

baseline_imagenet_loader = torch.utils.data.DataLoader(baseline_imagenet_dataset,
                                              shuffle= True,
                                              batch_size = 1024)

baseline_imagenet_loader_val = torch.utils.data.DataLoader(baseline_imagenet_dataset_val,
                                              shuffle= False,
                                              batch_size = 512)

class NetWrapper(pl.LightningModule):
    def __init__(self, model, criterion = torch.nn.CrossEntropyLoss()):
        super().__init__()
        self.model = model
        self.criterion = criterion
        self.lr = 1e-3
    
    def forward(self, x):
        raw_prob = self.model(x)
        return raw_prob
    
    def training_step(self, batch, batch_idx):
        x, y = batch
        preds = self(x)
        loss = self.criterion(preds, y)
#         self.log("cross_entropy_loss_training", loss)
        return loss
    
    def validation_step(self, batch, batch_idx):
        x, y = batch
        preds = self(x) 
        loss = self.criterion(preds, y)
#         self.log("cross_entropy_loss_val", loss)
        return loss
    
    def validation_epoch_end(self, validation_step_outputs):
        all_outputs = torch.cat(validation_step_outputs)
        std, mean = torch.std_mean(all_outputs)
#         self.log("validation_mean", mean)
#         self.log("validation_std", std)
        
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.lr)
        return optimizer

resnet50 = models.resnet50(pretrained=False)
resnet50_pl_module = NetWrapper(resnet50)
trainer = pl.Trainer(gpus=4, accelerator='ddp')
trainer.fit(resnet50_pl_module, baseline_imagenet_loader, baseline_imagenet_loader_val) # stuck here

I used Jupyter-lab to run the code, and I requested 32 cores, 512GB memory and 4 V100 on a shared cluster. But, when the trainer is stuck, I saw none of GPUs were running and no processes were shown on nvidia-smi. And I could not interrupt the kernel, so the only thing I could do is to restart the kernel.

I have read the tutorials, and the code seems good to me, but I am not sure whether it's good to go. Did I miss something?

Thank you!

awaelchli · 2021-02-13T20:20:05Z

awaelchli
Feb 13, 2021

You mention Jupyter Lab, did you run this in a cell?
DDP is not supported in Jupyter notebooks.
It needs to run as a script.

5 replies

ifsheldon Feb 13, 2021
Author

OK, I see. I just tried running as a script and it works! Thanks!

Can you please suggest the team that is in charge of documentation to state this issue at the first place? No intention to blame anyone, but the tutorial give me the illusion that I can do this in notebook, since some materials are just given as a form of notebook. Something like "you can use dp as accelerator in notebook to experiment your idea, but ddp will require running as a script" seems reasonable.

Although I think I can do a PR for that, I don't know whether it is appropriate to do so and where to highlight this note.

ifsheldon Feb 13, 2021
Author

Also, can you please tell me if there's anything else that cannot be run in the notebook? Thanks again!

awaelchli Feb 14, 2021

We documented it here https://pytorch-lightning.readthedocs.io/en/stable/multi_gpu.html#data-parallel
For Jupyter notebooks (including colab and alike), you can use accelerator="ddp_spawn" but that's the default anyway when you set the gpus flag in the Trainer. In that sense, the defaults are good for most users to start with. There are some more advanced accelerator available (like ddp and ddp sharded etc.). For this I recommend to read the documentation first before using them, as they are a bit more advanced.

awaelchli Feb 14, 2021

Also, can you please tell me if there's anything else that cannot be run in the notebook? Thanks again!

It is a bit difficult to give an upper bound on what does not work. Any features that need to run the code as a script will not work for obvious reasons, namely DDP (e.g. with torchelastic or SLURM) or horovod.

ifsheldon Feb 14, 2021
Author

Thank you! The reason why I changed the default to ddp is ddp_spawn is extremely slow and I see the reason after I read the page about multi-gpu. It will be really nice if you can write a sentence about this in Lightning in 2 Steps or How to organize PyTorch into Ligntning where codes about multi-gpus occur. Something like "There are some limitations, please refer to Multi-GPU training" will suffice I think. Because the behavior just getting stuck there and we cannot tell which factors are causing it., it will be great if you mention it and give a link to limitations early in the tutorials.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer.fit() stuck and cannot interrupt kernel #5947

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

trainer.fit() stuck and cannot interrupt kernel #5947

ifsheldon Feb 12, 2021

Replies: 1 comment · 5 replies

awaelchli Feb 13, 2021

ifsheldon Feb 13, 2021 Author

ifsheldon Feb 13, 2021 Author

awaelchli Feb 14, 2021

awaelchli Feb 14, 2021

ifsheldon Feb 14, 2021 Author

ifsheldon
Feb 12, 2021

Replies: 1 comment 5 replies

awaelchli
Feb 13, 2021

ifsheldon Feb 13, 2021
Author

ifsheldon Feb 13, 2021
Author

ifsheldon Feb 14, 2021
Author