test_step hangs after one iteration when on multiple GPUs #3730

vegovs · 2020-09-29T22:15:32Z

🐛 Bug

When running the same code on a computer with 1 gpu, test_step runs as normal and logs what it should.
How ever on a node with 4 gpus, it hangs after 1 iteration!

Code sample

 images, masks = batch["image"], batch["mask"]
        if images.shape[1] != self.hparams.n_channels:
            raise AssertionError(
                f"Network has been defined with {self.n_channels} input channels, "
                f"but loaded images have {images.shape[1]} channels. Please check that "
                "the images are loaded correctly."
            )

        masks = (
            masks.type(torch.float32)
            if self.hparams.n_classes == 1
            else masks.type(torch.long)
        )

        masks_pred = self(images)  # Forward pass
        loss = self.loss_funciton(masks_pred, masks)
        result = pl.EvalResult(loss, checkpoint_on=loss)
        result.log("test_loss", loss, on_step=True, on_epoch=True, sync_dist=True)
        rand_idx = randint(0, self.hparams.batch_size - 1)
        onehot = torch.sigmoid(masks_pred[rand_idx]) > 0.5
        for tag, value in self.named_parameters():
            tag = tag.replace(".", "/")
            self.logger.experiment.add_histogram(tag, value, self.current_epoch)
        mask_grid = torchvision.utils.make_grid([masks[rand_idx], onehot], nrow=2)
        self.logger.experiment.add_image(
            "TEST - Target vs Predicted", mask_grid, self.current_epoch
        )
        alpha = 0.5
        image_grid = torchvision.utils.make_grid(
            [
                images[rand_idx],
                torch.clamp(
                    kornia.enhance.add_weighted(
                        src1=images[rand_idx],
                        alpha=1.0,
                        src2=onehot,
                        beta=alpha,
                        gamma=0.0,
                    ),
                    max=1.0,
                ),
            ]
        )
        self.logger.experiment.add_image(
            "TEST - Image vs Predicted", image_grid, self.current_epoch
        )
        pred = (torch.sigmoid(masks_pred) > 0.5).float()
        f1 = f1_score(pred, masks, self.hparams.n_classes + 1)
        rec = recall(pred, masks, self.hparams.n_classes + 1)
        pres = precision(pred, masks, self.hparams.n_classes + 1)
        result.log("test_f1", f1, on_epoch=True)
        result.log("test_recall", rec, on_epoch=True)
        result.log("test_precision", pres, on_epoch=True)

        return result

Expected behavior

I expect it to finish the testing-epoch.

Environment

Environment 1
CUDA:

GPU:
GeForce RTX 2070 SUPER
available: True
version: 10.2
Packages:
numpy: 1.19.2
pyTorch_debug: False
pyTorch_version: 1.6.0
pytorch-lightning: 0.9.0
tqdm: 4.49.0
System:
OS: Linux
architecture:
64bit
ELF
processor: x86_64
python: 3.6.9
version: Rename ptl to pl #52~18.04.1-Ubuntu SMP Thu Sep 10 12:50:22 UTC 2020

Environment 2

CUDA:
- GPU:
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
- available: True
- version: 10.2
Packages:
- numpy: 1.19.1
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 0.9.0
- tqdm: 4.49.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.0
- version: Make print_nan_grads print grad #208-Ubuntu SMP Sun Apr 5 23:45:10 UTC 2020

The text was updated successfully, but these errors were encountered:

awaelchli · 2020-09-29T22:51:55Z

Can you post which trainer settings you are using?

vegovs · 2020-09-30T13:19:11Z

    def training_step(self, batch, batch_idx):
        images, masks = batch["image"], batch["mask"]
        if images.shape[1] != self.hparams.n_channels:
            raise AssertionError(
                f"Network has been defined with {self.hparams.n_channels} input channels, "
                f"but loaded images have {images.shape[1]} channels. Please check that "
                "the images are loaded correctly."
            )

        masks = (
            masks.type(torch.float32)
            if self.hparams.n_classes == 1
            else masks.type(torch.long)
        )

        masks_pred = self(images)  # Forward pass
        loss = self.loss_funciton(masks_pred, masks)
        result = pl.TrainResult(minimize=loss)
        result.log("train_loss", loss, sync_dist=True)
        if batch_idx == 0:
            self.logg_images(images, masks, masks_pred, "TRAIN")
        pred = (torch.sigmoid(masks_pred) > 0.5).float()
        f1 = f1_score(pred, masks, self.hparams.n_classes + 1)
        rec = recall(pred, masks, self.hparams.n_classes + 1)
        pres = precision(pred, masks, self.hparams.n_classes + 1)
        result.log("train_f1", f1, on_epoch=True)
        result.log("train_recall", rec, on_epoch=True)
        result.log("train_precision", pres, on_epoch=True)

        return result

    def validation_step(self, batch, batch_idx):
        images, masks = batch["image"], batch["mask"]
        if images.shape[1] != self.hparams.n_channels:
            raise AssertionError(
                f"Network has been defined with {self.n_channels} input channels, "
                f"but loaded images have {images.shape[1]} channels. Please check that "
                "the images are loaded correctly."
            )

        masks = (
            masks.type(torch.float32)
            if self.hparams.n_classes == 1
            else masks.type(torch.long)
        )

        masks_pred = self(images)  # Forward pass
        loss = self.loss_funciton(masks_pred, masks)
        result = pl.EvalResult(loss, checkpoint_on=loss)
        result.log("val_loss", loss, sync_dist=True)
        if batch_idx == 0:
            self.logg_images(images, masks, masks_pred, "VAL")
        pred = (torch.sigmoid(masks_pred) > 0.5).float()
        f1 = f1_score(pred, masks, self.hparams.n_classes + 1)
        rec = recall(pred, masks, self.hparams.n_classes + 1)
        pres = precision(pred, masks, self.hparams.n_classes + 1)
        result.log("val_f1", f1, on_epoch=True)
        result.log("val_recall", rec, on_epoch=True)
        result.log("val_precision", pres, on_epoch=True)

        return result

    def test_step(self, batch, batch_idx):
        images, masks = batch["image"], batch["mask"]
        if images.shape[1] != self.hparams.n_channels:
            raise AssertionError(
                f"Network has been defined with {self.n_channels} input channels, "
                f"but loaded images have {images.shape[1]} channels. Please check that "
                "the images are loaded correctly."
            )

        masks = (
            masks.type(torch.float32)
            if self.hparams.n_classes == 1
            else masks.type(torch.long)
        )

        masks_pred = self(images)  # Forward pass
        loss = self.loss_funciton(masks_pred, masks)
        result = pl.EvalResult(loss, checkpoint_on=loss)
        result.log("test_loss", loss, on_step=True, on_epoch=True, sync_dist=True)
        self.logg_images(images, masks, masks_pred, "TEST")
        pred = (torch.sigmoid(masks_pred) > 0.5).float()
        f1 = f1_score(pred, masks, self.hparams.n_classes + 1)
        rec = recall(pred, masks, self.hparams.n_classes + 1)
        pres = precision(pred, masks, self.hparams.n_classes + 1)
        result.log("test_f1", f1, on_epoch=True)
        result.log("test_recall", rec, on_epoch=True)
        result.log("test_precision", pres, on_epoch=True)

        return result

awaelchli · 2020-09-30T13:22:18Z

What arguments do you pass to Trainer(...)
Do you use distributed_backend=ddp?

vegovs · 2020-09-30T14:45:41Z

    try:
        trainer = Trainer.from_argparse_args(
            args,
            gpus=-1,
            precision=16,
            distributed_backend="ddp",
            callbacks=[lr_monitor],
            early_stop_callback=early_stopping,
            accumulate_grad_batches=1
            if not os.getenv("ACC_GRAD")
            else int(os.getenv("ACC_GRAD")),
            gradient_clip_val=0.0
            if not os.getenv("GRAD_CLIP")
            else float(os.getenv("GRAD_CLIP")),
            max_epochs=1000 if not os.getenv("EPOCHS") else int(os.getenv("EPOCHS")),
            default_root_dir=os.getcwd()
            if not os.getenv("DIR_ROOT_DIR")
            else os.getenv("DIR_ROOT_DIR"),
        )
        trainer.fit(model)
        trainer.test(model)
    except KeyboardInterrupt:
        torch.save(model.state_dict(), "INTERRUPTED.pth")
        logging.info("Saved interrupt")
        try:
            sys.exit(0)
        except SystemExit:
            os._exit(0)

vegovs · 2020-09-30T14:48:57Z

Is using DDP the issue? I used DDP on the environment with one GPU as well.

awaelchli · 2020-09-30T15:01:01Z

yes, unfortunately it looks like so. calling trainer.fit then trainer.test currently does not work with DDP due to the fact that it is launching in a separate script. We haven't found a good solution for this yet. For the one gpu case, it should work though. A solution is to switch to ddp_spawn, but it is not ideal. then all your classes need to be pickleable.
or you may move trainer.test to a separate script and call it independently.

vegovs added bug Something isn't working help wanted Open to be worked on labels Sep 29, 2020

awaelchli added the distributed Generic distributed-related topic label Sep 30, 2020

williamFalcon mentioned this issue Sep 30, 2020

[WIP] ref: decoupled ddp, ddp spawn #3733

Closed

edenlightning added this to the 0.9.x milestone Oct 2, 2020

williamFalcon mentioned this issue Oct 3, 2020

[WIP] ref: decoupled ddp, ddp spawn (finish 3733) #3819

Merged

williamFalcon closed this as completed in #3819 Oct 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_step hangs after one iteration when on multiple GPUs #3730

test_step hangs after one iteration when on multiple GPUs #3730

vegovs commented Sep 29, 2020

awaelchli commented Sep 29, 2020

vegovs commented Sep 30, 2020 •

edited

Loading

awaelchli commented Sep 30, 2020

vegovs commented Sep 30, 2020

vegovs commented Sep 30, 2020

awaelchli commented Sep 30, 2020 •

edited

Loading

test_step hangs after one iteration when on multiple GPUs #3730

test_step hangs after one iteration when on multiple GPUs #3730

Comments

vegovs commented Sep 29, 2020

🐛 Bug

Code sample

Expected behavior

Environment

awaelchli commented Sep 29, 2020

vegovs commented Sep 30, 2020 • edited Loading

awaelchli commented Sep 30, 2020

vegovs commented Sep 30, 2020

vegovs commented Sep 30, 2020

awaelchli commented Sep 30, 2020 • edited Loading

vegovs commented Sep 30, 2020 •

edited

Loading

awaelchli commented Sep 30, 2020 •

edited

Loading