How to save image artefacts in a multi GPU training #5729

laughingrice · 2021-02-01T02:41:03Z

laughingrice
Feb 1, 2021

I am using the mlflow logger and looking for the best way to save image artifact in a multi gpu setting.
Ideally I would like to log on the first batch of the following epoch, but for a start I'll be content logging the first batch of every epoch.

Currently I have this call to save on the first run of the epoch.
My problem is that the work is split between (at least) 4 workers, and each one is logging the results, with only one prevailing. Ignoring the extra work, I'm actually saving several values (input, output, error), and I get a different part of the batch for each one of them, so cannot compare.

Is there a way to check if I'm on the first sub-batch of the batch?

def training_step(self, batch, batch_idx):
    x, y = batch
    z = self(x)

    loss = F.mse_loss(z, y)

    if batch_idx == 0:
        self.logger.experiment.log_image(
            self.logger.run_id,
            (np.array(vutils.make_grid(
                y.detach(),
                normalize=True, range=tuple(self.hparams.ss_range), nrow=6).cpu())[0, ...] * 255.).astype(np.int),
            'train_labels.png')

laughingrice · 2021-02-01T20:07:01Z

laughingrice
Feb 1, 2021
Author

I think I figured out the right way to do this, adding the solution in case others run into the same issue.

The trick was to use a callback instead of doing it from the training function (this way there is also the external knowledge that an appropriate logger was given) so is cleaner.

The main issue that it requires another call to the network, but once per epoch, or several epochs is not much of an overhead

class ImgCB(pl.callbacks.base.Callback):
    def __init__(self, ss_range: tuple = (0.05, 0.15), error_range: tuple = (0, 0.05)):
        self.ss_range = ss_range
        self.error_range = error_range

    def log_images(self, mfl_logger, y, z, prefix):
        mfl_logger.experiment.log_image(
            mfl_logger.run_id,
            (np.array(vutils.make_grid(
                y.detach(),
                normalize=True, range=tuple(self.ss_range), nrow=6).cpu())[0, ...] * 255.).astype(np.int),
            prefix + '_labels.png')

        mfl_logger.experiment.log_image(
            mfl_logger.run_id,
            (np.array(vutils.make_grid(
                z.detach(),
                normalize=True, range=tuple(self.ss_range), nrow=6).cpu())[0, ...] * 255.).astype(np.int),
            prefix + '_outputs.png')

        mfl_logger.experiment.log_image(
            mfl_logger.run_id,
            (np.array(vutils.make_grid(
                torch.abs(y.detach() - z.detach()),
                normalize=True, range=tuple(self.error_range), nrow=6).cpu())[0, ...] * 255.).astype(np.int),
            prefix + '_errors.png')

    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
        if batch_idx == 0:
            with torch.no_grad():
                x, y = batch
                z = pl_module(x.to(pl_module.device))

                self.log_images(pl_module.logger, y.to(pl_module.device), z, 'train_')
    
    def on_validation_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
        if batch_idx == 0:
            with torch.no_grad():
                x, y = batch
                z = pl_module(x.to(pl_module.device))

                self.log_images(pl_module.logger, y.to(pl_module.device), z, 'test_')

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save image artefacts in a multi GPU training #5729

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to save image artefacts in a multi GPU training #5729

laughingrice Feb 1, 2021

Replies: 1 comment

laughingrice Feb 1, 2021 Author

laughingrice
Feb 1, 2021

laughingrice
Feb 1, 2021
Author