Densenet architectures providing non-deterministic results #5790

ankur56 · 2021-02-02T23:51:31Z

ankur56
Feb 2, 2021

❓ Questions and Help

Before asking:

Try to find answers to your questions in the Lightning Forum!
Search for similar issues.
Search the docs.

I have tried looking for answers in other forums but couldn't find anything related to my question.

What is your question?

I can't seem to obtain deterministic results using Densenets (https://github.com/gpleiss/efficient_densenet_pytorch). I was able to obtain deterministic results with a relatively simpler architecture, LitAutoEncoder. I was wondering if that's because of the large number of convolution layers involved in Densenet models.

Code

The Densenet code I am using is as follows,

#!/usr/bin/env python3

import os
import time
import torch
from torchvision import datasets, transforms
import argparse
import json
import pprint
import copy
import sys
import shutil
import numpy as np
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as cp
from collections import OrderedDict
from torch.utils.data import TensorDataset, DataLoader
from pytorch_lightning.callbacks import Callback, ModelCheckpoint
from pytorch_lightning import LightningModule, Trainer, seed_everything
import pytorch_lightning as pl

def _bn_function_factory(norm, relu, conv):
    def bn_function(*inputs):
        concated_features = torch.cat(inputs, 1)
        bottleneck_output = conv(relu(norm(concated_features)))
        return bottleneck_output

    return bn_function


class _DenseLayer(pl.LightningModule):
    def __init__(self,
                 num_input_features,
                 growth_rate,
                 bn_size,
                 drop_rate,
                 efficient=False):
        super(_DenseLayer, self).__init__()
        self.add_module('norm1', nn.BatchNorm2d(num_input_features)),
        self.add_module('relu1', nn.ReLU(inplace=True)),
        self.add_module(
            'conv1',
            nn.Conv2d(num_input_features,
                      bn_size * growth_rate,
                      kernel_size=1,
                      stride=1,
                      bias=False)),
        self.add_module('norm2', nn.BatchNorm2d(bn_size * growth_rate)),
        self.add_module('relu2', nn.ReLU(inplace=True)),
        self.add_module(
            'conv2',
            nn.Conv2d(bn_size * growth_rate,
                      growth_rate,
                      kernel_size=3,
                      stride=1,
                      padding=1,
                      bias=False)),
        self.drop_rate = drop_rate
        self.efficient = efficient

    def forward(self, *prev_features):
        bn_function = _bn_function_factory(self.norm1, self.relu1, self.conv1)
        if self.efficient and any(prev_feature.requires_grad
                                  for prev_feature in prev_features):
            bottleneck_output = cp.checkpoint(bn_function, *prev_features)
        else:
            bottleneck_output = bn_function(*prev_features)
        new_features = self.conv2(self.relu2(self.norm2(bottleneck_output)))
        if self.drop_rate > 0:
            new_features = F.dropout(new_features,
                                     p=self.drop_rate,
                                     training=self.training)
        return new_features


class _Transition(nn.Sequential):
    def __init__(self, num_input_features, num_output_features):
        super(_Transition, self).__init__()
        self.add_module('norm', nn.BatchNorm2d(num_input_features))
        self.add_module('relu', nn.ReLU(inplace=True))
        self.add_module(
            'conv',
            nn.Conv2d(num_input_features,
                      num_output_features,
                      kernel_size=1,
                      stride=1,
                      bias=False))
        self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))


class _DenseBlock(pl.LightningModule):
    def __init__(self,
                 num_layers,
                 num_input_features,
                 bn_size,
                 growth_rate,
                 drop_rate,
                 efficient=False):
        super(_DenseBlock, self).__init__()
        for i in range(num_layers):
            layer = _DenseLayer(
                num_input_features + i * growth_rate,
                growth_rate=growth_rate,
                bn_size=bn_size,
                drop_rate=drop_rate,
                efficient=efficient,
            )
            self.add_module('denselayer%d' % (i + 1), layer)

    def forward(self, init_features):
        features = [init_features]
        for name, layer in self.named_children():
            new_features = layer(*features)
            features.append(new_features)
        return torch.cat(features, 1)


class DenseNet(pl.LightningModule):
    r"""Densenet-BC model class, based on
    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`
    Args:
        growth_rate (int) - how many filters to add each layer (`k` in paper)
        block_config (list of 3 or 4 ints) - how many layers in each pooling block
        num_init_features (int) - the number of filters to learn in the first convolution layer
        bn_size (int) - multiplicative factor for number of bottle neck layers
            (i.e. bn_size * k features in the bottleneck layer)
        drop_rate (float) - dropout rate after each dense layer
        num_classes (int) - number of classification classes
        small_inputs (bool) - set to True if images are 32x32. Otherwise assumes images are larger.
        efficient (bool) - set to True to use checkpointing. Much more memory efficient, but slower.
    """
    def __init__(self,
                 growth_rate=12,
                 block_config=(16, 16, 16),
                 compression=0.5,
                 num_init_features=24,
                 bn_size=4,
                 drop_rate=0,
                 num_classes=10,
                 small_inputs=True,
                 efficient=False):

        super(DenseNet, self).__init__()
        assert 0 < compression <= 1, 'compression of densenet should be between 0 and 1'

        # First convolution
        if small_inputs:
            self.features = nn.Sequential(
                OrderedDict([
                    ('conv0',
                     nn.Conv2d(3,
                               num_init_features,
                               kernel_size=3,
                               stride=1,
                               padding=1,
                               bias=False)),
                ]))
        else:
            self.features = nn.Sequential(
                OrderedDict([
                    ('conv0',
                     nn.Conv2d(3,
                               num_init_features,
                               kernel_size=7,
                               stride=2,
                               padding=3,
                               bias=False)),
                ]))
            self.features.add_module('norm0',
                                     nn.BatchNorm2d(num_init_features))
            self.features.add_module('relu0', nn.ReLU(inplace=True))
            self.features.add_module(
                'pool0',
                nn.MaxPool2d(kernel_size=3,
                             stride=2,
                             padding=1,
                             ceil_mode=False))

        # Each denseblock
        num_features = num_init_features
        for i, num_layers in enumerate(block_config):
            block = _DenseBlock(
                num_layers=num_layers,
                num_input_features=num_features,
                bn_size=bn_size,
                growth_rate=growth_rate,
                drop_rate=drop_rate,
                efficient=efficient,
            )
            self.features.add_module('denseblock%d' % (i + 1), block)
            num_features = num_features + num_layers * growth_rate
            if i != len(block_config) - 1:
                trans = _Transition(num_input_features=num_features,
                                    num_output_features=int(num_features *
                                                            compression))
                self.features.add_module('transition%d' % (i + 1), trans)
                num_features = int(num_features * compression)

        # Final batch norm
        self.features.add_module('norm_final', nn.BatchNorm2d(num_features))

        # Linear layer
        self.classifier = nn.Linear(num_features, num_classes)

        # Initialization
        for name, param in self.named_parameters():
            if 'conv' in name and 'weight' in name:
                n = param.size(0) * param.size(2) * param.size(3)
                param.data.normal_().mul_(math.sqrt(2. / n))
            elif 'norm' in name and 'weight' in name:
                param.data.fill_(1)
            elif 'norm' in name and 'bias' in name:
                param.data.fill_(0)
            elif 'classifier' in name and 'bias' in name:
                param.data.fill_(0)

    def forward(self, x):
        features = self.features(x)
        out = F.relu(features, inplace=True)
        out = F.adaptive_avg_pool2d(out, (1, 1))
        out = torch.flatten(out, 1)
        out = self.classifier(out)
        return out

    def ce_loss(self, logits, labels):
        return F.cross_entropy(logits, labels)

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.ce_loss(logits, y)
        self.log('train_loss',
                 loss,
                 sync_dist=True,
                 on_epoch=True,
                 on_step=True)
        return {'loss': loss}

    def training_epoch_end(self, outputs):
        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
        self.log('avg_train_loss', avg_loss, on_epoch=True, sync_dist=True)

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        logits = self.forward(x)
        loss = self.ce_loss(logits, y)
        self.log('val_loss', loss, on_step=True, on_epoch=True, sync_dist=True)
        return {'rval_loss': loss}

    def validation_epoch_end(self, outputs):
        avg_loss = torch.stack([x['rval_loss'] for x in outputs]).mean()
        self.log('avg_val_loss', avg_loss, on_epoch=True, sync_dist=True)

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(),
                                    lr=0.1,
                                    momentum=0.9,
                                    nesterov=True,
                                    weight_decay=1e-4)

        lr_scheduler = {
            'scheduler':
            torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                       factor=0.75,
                                                       patience=5,
                                                       threshold=5e-3,
                                                       threshold_mode='abs',
                                                       cooldown=0,
                                                       min_lr=1e-6,
                                                       verbose=True),
            'name':
            'red_pl_lr',
            'monitor':
            'train_loss_epoch'
        }

        return [optimizer], [lr_scheduler]


class DataModule(pl.LightningDataModule):
    def __init__(self, batch_size=16):
        super().__init__()
        self.batch_size = batch_size

    def setup(self, stage=None):
        valid_size = 5000
        data = "/projects/data/"
        mean = [0.5071, 0.4867, 0.4408]
        stdv = [0.2675, 0.2565, 0.2761]
        train_transforms = transforms.Compose([
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize(mean=mean, std=stdv),
        ])
        test_transforms = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=mean, std=stdv),
        ])
        # Datasets
        self.train_set = datasets.CIFAR10(data,
                                          train=True,
                                          transform=train_transforms,
                                          download=True)
        self.test_set = datasets.CIFAR10(data,
                                         train=False,
                                         transform=test_transforms,
                                         download=False)

        if valid_size:
            valid_set = datasets.CIFAR10(data,
                                         train=True,
                                         transform=test_transforms)
            indices = torch.randperm(len(self.train_set))
            train_indices = indices[:len(indices) - valid_size]
            valid_indices = indices[len(indices) - valid_size:]
            self.train_set = torch.utils.data.Subset(self.train_set, train_indices)
            self.valid_set = torch.utils.data.Subset(valid_set, valid_indices)
        else:
            self.valid_set = None

    def train_dataloader(self):
        return DataLoader(self.train_set,
                          batch_size=self.batch_size,
                          num_workers=0,
                          pin_memory=True)

    def val_dataloader(self):
        return DataLoader(self.valid_set,
                          batch_size=self.batch_size,
                          num_workers=0,
                          pin_memory=True)


class my_callbacks(Callback):
    def __init__(self) -> None:
        self.metrics: List = []

    def on_epoch_end(self, trainer: Trainer,
                     pl_module: LightningModule) -> None:
        metrics_dict = copy.copy(trainer.callback_metrics)
        new_metrics_dict = {k: v.item() for k, v in metrics_dict.items()}
        pl_module.print(json.dumps(new_metrics_dict, indent=4, sort_keys=True),
                        flush=True)
						
seed_everything(22)
p_callback = my_callbacks()
data_module = DataModule()

model = DenseNet(
    growth_rate=12,
    block_config=(16, 16),
    num_init_features=64,
    num_classes=10,
    small_inputs=True,
    efficient=False,
)

trainer = pl.Trainer(
    gpus=-1,
    accelerator='ddp',
    benchmark=True,
    callbacks=[p_callback],
    max_epochs=2,
    deterministic=True,
    progress_bar_refresh_rate=0)

trainer.fit(model, data_module)

The code I am using for LitAutoEncoder is as follows,

#!/usr/bin/env python3

import os
import json
import time
import copy
from argparse import ArgumentParser
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data import random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
from pytorch_lightning.metrics.functional import accuracy
from pytorch_lightning.callbacks import Callback
from torchvision.datasets.mnist import MNIST
from torchvision import transforms
from pytorch_lightning import LightningModule, Trainer

pl.seed_everything(22)
batch_size = 32

dataset = MNIST(os.getcwd(),
                train=True,
                download=True,
                transform=transforms.ToTensor())
mnist_test = MNIST(os.getcwd(),
                   train=False,
                   download=True,
                   transform=transforms.ToTensor())
mnist_train, mnist_val = random_split(dataset, [55000, 5000])

train_loader = DataLoader(mnist_train, batch_size=batch_size, num_workers=4, pin_memory=True)
val_loader = DataLoader(mnist_val, batch_size=batch_size, num_workers=4, pin_memory=True)
test_loader = DataLoader(mnist_test, batch_size=batch_size, num_workers=4, pin_memory=True)


class LitAutoEncoder(pl.LightningModule):
    def __init__(self, batch_size=32, lr=1e-3):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(),
                                     nn.Linear(64, 3))
        self.decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(),
                                     nn.Linear(64, 28 * 28))
        self.batch_size = batch_size
        self.learning_rate = lr

    def forward(self, x):
        # in lightning, forward defines the prediction/inference actions
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log('train_loss', loss, on_step=True, on_epoch=True, sync_dist=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log('val_loss', loss, on_step=True, on_epoch=True, sync_dist=True)

    def test_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log('test_loss', loss, on_step=True, on_epoch=True, sync_dist=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

class my_callbacks(Callback):
    def __init__(self) -> None:
        self.metrics: List = []

    def on_epoch_end(self, trainer: Trainer,
                     pl_module: LightningModule) -> None:
        metrics_dict = copy.copy(trainer.callback_metrics)
        new_metrics_dict = {k: v.item() for k, v in metrics_dict.items()}
        pl_module.print(json.dumps(new_metrics_dict, indent=4, sort_keys=True),
                        flush=True)

model = LitAutoEncoder()
p_callback = my_callbacks()

trainer = pl.Trainer(progress_bar_refresh_rate=0,
                     max_epochs=2,
                     gpus=-1,
                     callbacks=[p_callback],
                     benchmark=True,
                     accelerator='ddp',
                     deterministic=True)
					 
trainer.fit(model, train_loader, val_loader)

What have you tried?

I am running all of my jobs on a supercomputer. I have tried running the code multiple times on the same node to remove any randomness due to having a different machine, but apparently, that doesn't make any difference.

What's your environment?

OS: Linux
Packaging: pip
Version: Pytorch-Lightning 1.2.0rc0

akihironitta · 2021-02-03T00:23:41Z

akihironitta
Feb 3, 2021

@ankur56 I ran your densenet code multiple times locally, but they all give the same results in my env. Would you be able to reproduce the behaviour with BoringModel?

my env

* CUDA:
	- GPU:
	- available:         False
	- version:           None
* Packages:
	- numpy:             1.19.5
	- pyTorch_debug:     False
	- pyTorch_version:   1.7.1+cpu
	- pytorch-lightning: 1.2.0rc0
	- tqdm:              4.49.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		- ELF
	- processor:         
	- python:            3.8.5
	- version:           #1 SMP Debian 4.19.160-2 (2020-11-28)

0 replies

ankur56 · 2021-02-03T00:41:52Z

ankur56
Feb 3, 2021
Author

@akihironitta Thank you for trying out my code. Please correct me if I am wrong, but from your environment, it doesn't seem like you are using GPUs with DDP. Are you running it only on CPUs? As far as I know, non-determinism mostly arises while using GPUs. As I mentioned in my question, I can obtain deterministic behavior with the simpler model (LitAutoEncoder) I posted, so it's highly likely that I will be able to obtain determinism with the Boring Model as well. I will give that one a try. But if you have access to a multi-GPU node, please run the Densenet model on it, and let me know what you get.

0 replies

akihironitta · 2021-02-03T00:53:40Z

akihironitta
Feb 3, 2021

@ankur56 You're right, it worked right on CPUs, so the problem should arise when running on accelerators as you mentioned. Unfortunately, I currently do not have access to any other computing resources, so I cannot investigate this further at the moment, but I'm sure other core members will do. Sorry for your inconvenience.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Densenet architectures providing non-deterministic results #5790

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Densenet architectures providing non-deterministic results #5790

ankur56 Feb 2, 2021

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

Replies: 3 comments

akihironitta Feb 3, 2021

ankur56 Feb 3, 2021 Author

akihironitta Feb 3, 2021

ankur56
Feb 2, 2021

akihironitta
Feb 3, 2021

ankur56
Feb 3, 2021
Author

akihironitta
Feb 3, 2021