-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
distributedGeneric distributed-related topicGeneric distributed-related topicfeatureIs an improvement or enhancementIs an improvement or enhancementhelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task
Milestone
Description
🐛 Bug
Returning None from training_step with multi GPU DDP training freezes the training without exception
To Reproduce
Starting multi-gpu training with a None-returning training_step function.
Example training_step function:
def training_step(self, batch, batch_idx):
data, target = batch
model_outputs = self.forward(images)
loss = calc_loss(model_outputs, target)
if torch.isnan(loss) or random.random() < .05:
return None
return loss
Example trainer:
trainer = Trainer(
gpus=2,
distributed_backend="ddp"
)
Expected behavior
To continue training with skipping the current batch as pointed out at here.
Environment
No specific environment is needed to reproduce this bug.
Additional context
This issue was mentioned here: #4956 but not with specifics.
Note: By the time this issue being investigated, a help for a workaround would be great!
shehzaidi and Miking98gayathri-er-12052 and haoyuhsu
Metadata
Metadata
Assignees
Labels
distributedGeneric distributed-related topicGeneric distributed-related topicfeatureIs an improvement or enhancementIs an improvement or enhancementhelp wantedOpen to be worked onOpen to be worked onpriority: 1Medium priority taskMedium priority task