Error in returning `Dict` from `training_step` with multiple GPUs #6193

kchuang625 · 2021-02-25T02:47:43Z

🐛 Bug

When using multiple GPUs with 'dp', the error RuntimeError: grad can be implicitly created only for scalar outputs occurs if I utilized training_step function like this:

def training_step(self, batch, batch_idx):
    ...
    return {'loss': loss}

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1hmHqYHPOqDlZUAF7-9zcCvobrvSPt7W5?usp=sharing

Expected behavior

It is supposed to work fine to return Dict with loss key.

A quick solution

Return loss tensor directly from training_step function:

def training_step(self, batch, batch_idx):
    ...
    return loss

Environment

PyTorchLightning Version: 1.2.0
PyTorch Version: 1.7.0
OS: Linux
Python version: 3.8
CUDA/cuDNN version: 10.2

cc. @carmocca

The text was updated successfully, but these errors were encountered:

edenlightning · 2021-03-01T15:41:20Z

Need to support reducing returned dict and not just tensors.

kchuang625 added bug Something isn't working help wanted Open to be worked on labels Feb 25, 2021

SeanNaren added the distributed Generic distributed-related topic label Feb 25, 2021

carmocca added the priority: 0 High priority task label Feb 27, 2021

edenlightning added with code good first issue Good for newcomers labels Mar 1, 2021

tchaton self-assigned this Mar 3, 2021

tchaton mentioned this issue Mar 3, 2021

[bugfix] Perform reduction for dict in training_step and DP #6324

Merged

11 tasks

tchaton closed this as completed in #6324 Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in returning `Dict` from `training_step` with multiple GPUs #6193

Error in returning `Dict` from `training_step` with multiple GPUs #6193

kchuang625 commented Feb 25, 2021 •

edited

Loading

edenlightning commented Mar 1, 2021

Error in returning Dict from training_step with multiple GPUs #6193

Error in returning Dict from training_step with multiple GPUs #6193

Comments

kchuang625 commented Feb 25, 2021 • edited Loading

🐛 Bug

Please reproduce using the BoringModel

Expected behavior

A quick solution

Environment

edenlightning commented Mar 1, 2021

Error in returning `Dict` from `training_step` with multiple GPUs #6193

Error in returning `Dict` from `training_step` with multiple GPUs #6193

kchuang625 commented Feb 25, 2021 •

edited

Loading