Bind a LightningModule
to more than 1 GPU.
#1486
Replies: 5 comments
-
Update: I'm checking whether the following method may solve the problem |
Beta Was this translation helpful? Give feedback.
-
With the following code """
Multi-node example (GPU)
"""
import os
from argparse import ArgumentParser
from transformers import BertModel, DistilBertForSequenceClassification, AdamW
import numpy as np
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
def flatten(list_of_lists):
return [item for sublist in list_of_lists for item in sublist]
SEED = 2334
torch.manual_seed(SEED)
np.random.seed(SEED)
class CustomModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.b1 = DistilBertForSequenceClassification.from_pretrained("distilbert-base-cased", cache_dir="../cache/models")
self.b2 = DistilBertForSequenceClassification.from_pretrained("distilbert-base-cased", cache_dir="../cache/models")
def forward(self, inputs):
# model outputs are always tuple in transformers (see doc)
self.b1.cuda(0)
self.b2.cuda(1)
inputs_teacher = {k: v.cuda(0) for k, v in inputs.items()}
inputs_student = {k: v.cuda(1) for k, v in inputs.items()}
teacher_out = self.b1(**inputs_teacher)[0]
student_out = self.b2(**inputs_student)[0]
return teacher_out, student_out.cuda(0)
def training_step(self, batch, batch_idx):
input_ids, attention_mask, token_type_ids = batch[:, 0, :], batch[:, 1, :], batch[:, 2, :]
inputs = {
"input_ids": input_ids,
"attention_mask": attention_mask,
}
res_a, res_b = self.forward(inputs)
# loss as difference between predictions
loss = (res_a - res_b).mean()
return {'loss': loss }
def train_dataloader(self):
# create face dataset
loader = torch.tensor([
[[0] * 128, [0] * 128, [0] * 128]
] * 10000, dtype=torch.int64)
return DataLoader(loader, batch_size=8)
def configure_optimizers(self):
models = [self.b1, self.b2]
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": flatten([[p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)] \
for model in models]),
"weight_decay": 0.0,
},
{
"params": flatten([[p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)] \
for model in models]),
"weight_decay": 0.0
},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=0.0005)
return optimizer
def main(hparams):
"""Main training routine specific for this project."""
# ------------------------
# 1 INIT LIGHTNING MODEL
# ------------------------
model = CustomModel()
# ------------------------
# 2 INIT TRAINER
# ------------------------
trainer = pl.Trainer(
max_epochs=2,
gpus=[0, 1]
)
# ------------------------
# 3 START TRAINING
# ------------------------
trainer.fit(model)
if __name__ == '__main__':
# ------------------------
# TRAINING ARGUMENTS
# ------------------------
# these are project-wide arguments
root_dir = os.path.dirname(os.path.realpath(__file__))
parent_parser = ArgumentParser(add_help=False)
# gpu args
parent_parser.add_argument(
'--gpus',
type=int,
default=2,
help='how many gpus'
)
parent_parser.add_argument(
'--distributed_backend',
type=str,
default='dp',
help='supports three options dp, ddp, ddp2'
)
parent_parser.add_argument(
'--use_16bit',
dest='use_16bit',
action='store_true',
help='if true uses 16 bit precision'
)
# each LightningModule defines arguments relevant to it
hyperparams = parent_parser.parse_args()
# ---------------------
# RUN TRAINING
# ---------------------
main(hyperparams) I receive the error
|
Beta Was this translation helpful? Give feedback.
-
After some debugging, I found that Lightning is trying to apply DataParallel to both models in any case. In fact if I print |
Beta Was this translation helpful? Give feedback.
-
ummm. any update on this? |
Beta Was this translation helpful? Give feedback.
-
I think specifying
So there's no need to use DataParallel or have Lightning use DataParallel for you |
Beta Was this translation helpful? Give feedback.
-
Use case
I created a
LightningModule
that contains two models and each one should be put on a dedicated GPU. So, even while not doing distributed training, the minimum GPU number should be 2 to run this experiment (or 0).This is necessary since I have to train 2 interacting BERT models together and they do not fit in a single GPU. A similar application would be having a GAN with a big generator and a big discriminator.
This could be solved by allowing a model to require
X
GPUs, and then dividing the number of GPUs on a machine byX
receiving back the number of GPU "clusters".Is there a way to do that or is this functionality going to be implemented?
What have you tried?
Tried to run an experiment with a
LightningModule
containing two BERT models on a machine with two GPUs. Models are not assigned one for GPU, so training is really slow. Moreover the batch size is very small because each model is replicated on every GPU and memory is basically full before loading training data.What's your environment?
Beta Was this translation helpful? Give feedback.
All reactions