Tensors must be CUDA and dense on using DDP #2529

hadarishav · 2020-07-06T11:56:27Z

hadarishav
Jul 6, 2020

My code works fine when I am using single GPU, however when I switch multi-gpu and ddp I get the following error -

Traceback (most recent call last):
  File "stl_bert_trial_lightning.py", line 276, in <module>
    trainer.fit(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 968, in fit
    self.spawn_ddp_children(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 449, in spawn_ddp_children
    self.ddp_train(local_rank, model, is_master=True)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 538, in ddp_train
    self.run_pretrain_routine(model)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 1139, in run_pretrain_routine
    False)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 347, in _evaluate
    self.reduce_eval_ddp(eval_results)
  File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 366, in reduce_eval_ddp
    dist.all_reduce(v, op=dist.reduce_op.SUM)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 898, in all_reduce
    work = _default_pg.allreduce([tensor], opts)
RuntimeError: Tensors must be CUDA and dense

Any help is appreciated.

jmarsil · 2020-07-06T20:38:00Z

jmarsil
Jul 6, 2020

Got the exact same error with the newest release of ptl, if you move back down to version 0.8.1 it should work. (pip install pytorch-lightning==0.8.1) @hadarishav

0 replies

cmpute · 2020-07-08T02:34:44Z

cmpute
Jul 8, 2020

I also got the same error. The all_reduce function requires all tensors to be on CUDA, so in my case I put all output from validation_epoch_end to CUDA and the problem was solved.

1 reply

ParamsRaman Nov 23, 2021

@cmpute could you share the exact code you used to do this? (i.e. move the output from that variable to CUDA)
I am facing the same error

tmcclintock · 2020-08-26T00:44:39Z

tmcclintock
Aug 26, 2020

I am also experiencing this issue on PL 0.8.5, python 3.7.8, torch 1.5.1.

When using backend_distributed="ddp" with gpus=0 everything is fine. If I set gpus=[0,] or gpus=[0, 1] then I obtain the error posted above.

Is this solved in PL 0.9 and the issue was not closed? The traceback the OP posted and my own traceback do not map to lines in the current repo.

0 replies

velocityCavalry · 2020-09-05T02:38:05Z

velocityCavalry
Sep 5, 2020

I also had this issue, but it is because when I calculate the accuracy I wasn't doing device=x.device.

1 reply

ParamsRaman Nov 23, 2021

@velocityCavalry could you share a code snippet if you have one of the above fix? I am facing a similar problem and want to check if your fix solves my issue too..

awaelchli · 2020-09-20T00:05:03Z

awaelchli
Sep 20, 2020

@velocityCavalry could you share the code snippet how you calculated the accuracy? I am wondering which tensors you had that weren't already on the gpu. The outputs that that go into validation_epoch_end should be on the right device. However, if you create a new tensor and return that in validation_epoch_end, you need to make sure it is on "self.device"

0 replies

Vozf · 2020-11-06T14:43:05Z

Vozf
Nov 6, 2020

Same with pl 1.0.5

0 replies

pableeto · 2020-11-12T06:45:40Z

pableeto
Nov 12, 2020

I've faced the same "Tensors must be CUDA and dense" issue with pl 1.0.5. @williamFalcon

Here is my custom metric classes:


# Metric Class.
class ForegroundIOU(Metric):

    def __init__(self, dist_sync_on_step=False):
        super().__init__(dist_sync_on_step=dist_sync_on_step)

        self.add_state("n_image", default=torch.tensor(0.), dist_reduce_fx="sum")
        self.add_state("iou_sum", default=torch.tensor(0.), dist_reduce_fx="sum")

    def update(self, preds: torch.Tensor, target: torch.Tensor):
        assert preds.shape == target.shape
        # Note: Input should be NCHW or NTCHW format.
        if(len(preds.shape) == 5):
            # flatten NT dim.
            preds = torch.flatten(preds, end_dim=1)
            target = torch.flatten(target, end_dim=1)

        self.n_image = self.n_image + preds.shape[0]
        self.iou_sum = self.iou_sum + torch.sum(self.f_iou_metric(preds, target))

    def compute(self):
        val = self.iou_sum.float() / self.n_image.float()
        return val

    @staticmethod
    def f_iou_metric(mask: torch.FloatTensor, mask_gt: torch.FloatTensor, soft_param=1.0, threshold=0.5):
        mv = (mask > threshold).view(mask.shape[0], -1)
        gv = (mask_gt > threshold).view(mask_gt.shape[0], -1)
        tp = torch.sum(mv & gv, dim=1) + soft_param
        fp = torch.sum(mv & torch.logical_not(gv), dim=1)
        fn = torch.sum(torch.logical_not(mv) & gv, dim=1)
        return tp.float() / (tp + fp + fn).float()

and in my test loop:

def test_step(self, batch: Dict, batch_idx: int):
      # ... other codes for computing results ...
      self.iou_metric(y_hat_sigmoid, y)    # iou_metric is a instance of ForegroundIOU class defined above.
      self.log('final_iou', self.iou_metric)

When running this code with DDP backend, the "Tensors must be CUDA and dense" error will happen at the end of test.

0 replies

sustcsonglin · 2021-03-11T07:19:12Z

sustcsonglin
Mar 11, 2021

same

3 replies

sustcsonglin Apr 8, 2021

metric.to(self.device) works for me

ParamsRaman Nov 23, 2021

@sustcsonglin do you mind sharing a code snippet where you are setting this device? I would like to understand better.. I am facing the same error..

WiGig11 Mar 11, 2024

metric.to(self.device) works for me also works for me magically! howeber i still wonder why this error is encountered, the metric is build, for example , with the mesloss (input, pred), while the input and the pred are all on the CUDA, the result should be on CUDA as well, Can u help me? Thank u very much!!

carmocca · 2021-03-11T14:41:20Z

carmocca
Mar 11, 2021

Hi everybody! If one of you could provide a minimum script to reproduce, that'd be great

Then open an bug report in the issues tab with it. Discussion is better suited to... discuss :D

2 replies

ParamsRaman Nov 23, 2021

@carmocca please let me know if you were able to get a test script to reproduce the error and probable fixes that folks have tried here.. I am facing the same error but want to know exactly how the problematic tensor is handled to resolve the problem (exact code snippet)..

LukeLIN-web Nov 9, 2022

I fail at pytorch/pytorch#88685

cstsunfu · 2022-10-28T08:31:53Z

cstsunfu
Oct 28, 2022

Got the same error. On pl==1.6.0

0 replies

LukeLIN-web · 2022-11-09T07:57:18Z

LukeLIN-web
Nov 9, 2022

same bug at pytorch/pytorch#88685

0 replies

karthi0804 · 2023-07-08T10:53:01Z

karthi0804
Jul 8, 2023

Using, torch == 2.0.1, torchmetrics==1.0.0, pytorch_lightning == 2.0.4, deepspeed == 0.9.5, I was still getting the error. The metric object was not loaded into the respective cuda device when metric was initialized in __init__ of pl.LightningModule . I managed to fix the issue by forcing metric = metric.to(self.device) in validation_step

1 reply

awaelchli Jul 8, 2023

@karthi0804 You are using deepspeed stage 3 training, right? See my explanation here: #17748 (comment)
If you move your layer definition into configure_sharded_model() hook in the LightningModule, it should work with the latest DeepSpeed version.

fengye1966 · 2023-09-15T06:38:45Z

fengye1966
Sep 15, 2023

I have met with same problem. make sure you're using tensors on GPU with config of collect_device='cuda'. Otherwise your collect device should be 'cpu'.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensors must be CUDA and dense on using DDP #2529

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tensors must be CUDA and dense on using DDP #2529

Replies: 13 comments · 8 replies

Replies: 13 comments 8 replies