Skip to content

Returning None from training_step with multi GPU DDP training #5243

@iamkucuk

Description

@iamkucuk

🐛 Bug

Returning None from training_step with multi GPU DDP training freezes the training without exception

To Reproduce

Starting multi-gpu training with a None-returning training_step function.

Example training_step function:

    def training_step(self, batch, batch_idx):
        data, target = batch
        model_outputs = self.forward(images)
        loss = calc_loss(model_outputs, target)

        if torch.isnan(loss) or random.random() < .05:
            return None

        return loss

Example trainer:

 trainer = Trainer(
    gpus=2,
    distributed_backend="ddp"
)

Expected behavior

To continue training with skipping the current batch as pointed out at here.

Environment

No specific environment is needed to reproduce this bug.

Additional context

This issue was mentioned here: #4956 but not with specifics.

Note: By the time this issue being investigated, a help for a workaround would be great!

Metadata

Metadata

Assignees

Labels

distributedGeneric distributed-related topicfeatureIs an improvement or enhancementhelp wantedOpen to be worked onpriority: 1Medium priority task

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions