Bind a `LightningModule` to more than 1 GPU. #1486

lucadiliello · 2020-04-14T09:51:11Z

lucadiliello
Apr 14, 2020

Use case

I created a LightningModule that contains two models and each one should be put on a dedicated GPU. So, even while not doing distributed training, the minimum GPU number should be 2 to run this experiment (or 0).
This is necessary since I have to train 2 interacting BERT models together and they do not fit in a single GPU. A similar application would be having a GAN with a big generator and a big discriminator.
This could be solved by allowing a model to require X GPUs, and then dividing the number of GPUs on a machine by X receiving back the number of GPU "clusters".
Is there a way to do that or is this functionality going to be implemented?

What have you tried?

Tried to run an experiment with a LightningModule containing two BERT models on a machine with two GPUs. Models are not assigned one for GPU, so training is really slow. Moreover the batch size is very small because each model is replicated on every GPU and memory is basically full before loading training data.

What's your environment?

OS: Ubuntu 18.04.4 LTS
Packaging pip
Version 0.7.3

lucadiliello · 2020-04-14T13:59:19Z

lucadiliello
Apr 14, 2020
Author

Update: I'm checking whether the following method may solve the problem
https://youtu.be/TM_jRrXYXxc?t=2089

0 replies

lucadiliello · 2020-04-15T18:47:46Z

lucadiliello
Apr 15, 2020
Author

With the following code

"""
Multi-node example (GPU)
"""
import os
from argparse import ArgumentParser
from transformers import BertModel, DistilBertForSequenceClassification, AdamW
import numpy as np
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl

def flatten(list_of_lists):
    return [item for sublist in list_of_lists for item in sublist]

SEED = 2334
torch.manual_seed(SEED)
np.random.seed(SEED)

class CustomModel(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.b1 = DistilBertForSequenceClassification.from_pretrained("distilbert-base-cased", cache_dir="../cache/models")
        self.b2 = DistilBertForSequenceClassification.from_pretrained("distilbert-base-cased", cache_dir="../cache/models")

    def forward(self, inputs):
        # model outputs are always tuple in transformers (see doc)
        self.b1.cuda(0)
        self.b2.cuda(1)

        inputs_teacher = {k: v.cuda(0) for k, v in inputs.items()}
        inputs_student = {k: v.cuda(1) for k, v in inputs.items()}

        teacher_out = self.b1(**inputs_teacher)[0]
        student_out = self.b2(**inputs_student)[0]

        return teacher_out, student_out.cuda(0)

    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, token_type_ids = batch[:, 0, :], batch[:, 1, :], batch[:, 2, :]

        inputs = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
        }
        res_a, res_b = self.forward(inputs)
        # loss as difference between predictions
        loss = (res_a - res_b).mean()
        return {'loss': loss }

    def train_dataloader(self):
        # create face dataset
        loader = torch.tensor([
            [[0] * 128, [0] * 128, [0] * 128]
        ] * 10000, dtype=torch.int64)
        return DataLoader(loader, batch_size=8)

    def configure_optimizers(self):
        models = [self.b1, self.b2]

        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": flatten([[p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)] \
                    for model in models]),
                "weight_decay": 0.0,
            },
            {
                "params": flatten([[p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)] \
                    for model in models]),
                "weight_decay": 0.0
            },
        ]

        optimizer = AdamW(optimizer_grouped_parameters, lr=0.0005)
        return optimizer


def main(hparams):
    """Main training routine specific for this project."""
    # ------------------------
    # 1 INIT LIGHTNING MODEL
    # ------------------------
    model = CustomModel()

    # ------------------------
    # 2 INIT TRAINER
    # ------------------------
    trainer = pl.Trainer(
        max_epochs=2,
        gpus=[0, 1]
    )

    # ------------------------
    # 3 START TRAINING
    # ------------------------
    trainer.fit(model)


if __name__ == '__main__':
    # ------------------------
    # TRAINING ARGUMENTS
    # ------------------------
    # these are project-wide arguments

    root_dir = os.path.dirname(os.path.realpath(__file__))
    parent_parser = ArgumentParser(add_help=False)

    # gpu args
    parent_parser.add_argument(
        '--gpus',
        type=int,
        default=2,
        help='how many gpus'
    )
    parent_parser.add_argument(
        '--distributed_backend',
        type=str,
        default='dp',
        help='supports three options dp, ddp, ddp2'
    )
    parent_parser.add_argument(
        '--use_16bit',
        dest='use_16bit',
        action='store_true',
        help='if true uses 16 bit precision'
    )

    # each LightningModule defines arguments relevant to it
    hyperparams = parent_parser.parse_args()

    # ---------------------
    # RUN TRAINING
    # ---------------------
    main(hyperparams)

I receive the error

RuntimeError: Function AddmmBackward returned an invalid gradient at index 0 - expected device cuda:1 but got cuda:0

0 replies

lucadiliello · 2020-04-15T20:50:52Z

lucadiliello
Apr 15, 2020
Author

After some debugging, I found that Lightning is trying to apply DataParallel to both models in any case. In fact if I print res_a, I receive two results with half the batch size.

0 replies

williamFalcon · 2020-04-17T17:34:27Z

williamFalcon
Apr 17, 2020
Maintainer

ummm. any update on this?
@Borda @jeremyjordan want to look into this one?

0 replies

HenryJia · 2020-04-18T22:39:43Z

HenryJia
Apr 18, 2020

I think specifying gpus = 0 would work in this case? You've already handled the transfer of variables from CPU to GPU yourself using

        self.b1.cuda(0)
        self.b2.cuda(1)

        inputs_teacher = {k: v.cuda(0) for k, v in inputs.items()}
        inputs_student = {k: v.cuda(1) for k, v in inputs.items()}

So there's no need to use DataParallel or have Lightning use DataParallel for you

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bind a `LightningModule` to more than 1 GPU. #1486

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Bind a LightningModule to more than 1 GPU. #1486

lucadiliello Apr 14, 2020

Use case

What have you tried?

What's your environment?

Replies: 5 comments

lucadiliello Apr 14, 2020 Author

lucadiliello Apr 15, 2020 Author

lucadiliello Apr 15, 2020 Author

williamFalcon Apr 17, 2020 Maintainer

HenryJia Apr 18, 2020

Bind a `LightningModule` to more than 1 GPU. #1486

lucadiliello
Apr 14, 2020

lucadiliello
Apr 14, 2020
Author

lucadiliello
Apr 15, 2020
Author

lucadiliello
Apr 15, 2020
Author

williamFalcon
Apr 17, 2020
Maintainer

HenryJia
Apr 18, 2020